ColdFusion Instance Issues, Round 2

ColdFusion LogoAs some of you might know, my work place is no stranger to odd server problems, such as instances just wanting to return 503 errors. Today I have a new one: one of our applications refuses to run at full speed if its in anything other than the default instance.

The Servers

I think it might be best to spell out the hardware and software configuration of the servers I'm talking about, so its not lost in the text. Its important to note that all of these tests are talking to the same database and database server. All server names have been changed to protect the innocent.

Komodo Pair
Load balanced identical pair of servers, Komodo1 and Komodo2. Hardware: Quad core 3 ghz (4 cores), 6gb of ram OS & CF: Microsoft Windows Server 2003 R2 (32-bit), CF 8 32-bit Instances: CMS was running in its own instance pair, Komodo1i1 and Komodo2i1, and later in cfusion.

Jaguar Pair
Load balanced identical pair of servers, Jaguar1 and Jaguar2. Hardware: Dual Quad core 1.6 ghz (8 cores), 8gb of ram OS & CF: Microsoft Windows Server 2003 R2 (32-bit), CF 8 32-bit Instances: CMS was running in its own instance pair, Jaguar1i1 and Jaguar2i1.

Jaybird
Hardware: Dual core 1.8ghz (2 cores), 2gb of ram OS & CF: Server 2003 (32-bit), ColdFusion 8 32-bit Instances: CMS was running in the main instance, cfusion.

Adder
Our test machine. Hardware: Dual core 3ghz (2 cores), 4gb of ram OS & CF: Server 2003 64-bit, ColdFusion 8 64-bit Instances: CMS was running in the main instance, cfusion.

The Problem

We have a CMS that powers some of our client websites, and several months ago we moved it to a load balanced pare of servers (Jaguar Pair) looking to see some better performance since the site was kind of query heavy (older coding and lots of recursive querying) and it had been housed with a number of other sites on a shared server (Jaybird). After moving it we didn't really see a performance boost, but we knew it had to be faster, I mean, its a better server, how could it not be faster?

Fast forward to now, and after some user complaints, we're observing serious speed differences between our CMS on the Jaguar Pair and a development copy on Adder, our test machine. I'm not talking about a little sluggishness, but a difference of 6-8 times longer to return requests on the Jaguar Pair vs Adder. Specifically, we were testing a page that reliably returned in less then a second on Adder, but took around 8 seconds on the Jaguar Pair.

The Tests

So, the first thing we tried to do was see see if the network traffic was the same, and our server team tracked packet transfers back and forth between the servers to make sure that everything was going to the same location, and along the same paths. That part checked out, but it still showed that it simply took 8 times as long to get back a request for one server.

Next up was to make a copy of the CMS back onto Jaybird and do some testing, so that we could look at a single instance 32-bit OS. Testing showed that Jaybird did not experience slowness either, so the 320-bit OS was not to blame.

After that we wanted to see if it was something to do with server configurations, so we made a new instance pair on our other load balanced server pair, Komodo, and got everything setup only to see the same performance problems on the Komodo Pair as we saw on the Jaguar Pair, so this behavior was not limited to a bad configuration the Jaguar Pair or something like that.

Failing that test, we decided to see if there was some sort of overhead being incurred by CF instance clustering, so we isolated Komodo1i1 and removed it from the CF cluster. Tests showed that Komodo1i1 still lagged at 8 seconds to return results, even though it was running as a single instance.

Lastly, we removed out test site from an instance entirely, and let it run in the main instance, cfusion, to find that suddenly the response times dropped back to normal.

The Conclussion

The only conclusion that we can reach is that there is some sort of overhead that the instances are running into that the main cfusion instance is not experiencing. In all of these tests the only way we found to bring the response times back in line was to remove the program from an instance and let it run in the default instance, such as on Jaybird, Adder, and later Komodo. Any time we have it in an instance, such as on the Komodo Pair or the Jaguar pair, the system slows to a crawl.

Running our CMS under the main instance obviously isn't a desirable solution, since we want to be able to have multiple instances on these load balanced servers. We do have other applications running on other instances on the Komodo Pair and the Jaguar Pair, so we can't really just run everything in the main instance, and we haven't been noticing slow down with those applications. Its possible that we are experiencing a slow down with those applications as well, but that its not as noticeable since they are not as query heavy as the CMS.

Does any one have an idea what might be causing this slow down?

 

Comments

marc esher's Gravatar No idea at all, but i'm subscribing to see what others have to say.

I'm curious though: have you tried hooking up a java profiler to see if you can get a different "look" into the server? I've used YourKit with CF in the past and it's been helpful, though not in diagnosing weird stuff like this. anyways, just a thought.

good luck!
Jon Hartmann's Gravatar @marc: Thanks for the idea; I'll see what we can find with that. I know that the network guys were using something called WireShark to look at the transfer of information, and I think they used a SeeFusion trial on it as well, although they weren't able to identify a problem with it.
marc esher's Gravatar could it possibly be a connector issue? I know this would probably not be a fun task, but what would happen if you swapped out your webserver for another (IIS to apache)? or even just reinstalled the webserver and then reassociated the instances? I remember a while back on my dev machine, I had this weirdo situation where every so often, requests would hang for quite a while, yet the execution times were always reported in the milliseconds, which meant to me that the problem was happening in between CF and the webserver, not within CF itself.
Sean Corfield's Gravatar I would definitely look at the connector configuration. I have seen similar behavior once before when the connector was configured to connect to multiple instances and if there was a problem with the first configured instance, it could take several seconds to "attach" to the second instance for each request.

My thoughts on the JRun connector are fairly well known(!) so I would suggest testing by hitting the JRun Web Server port directly to see whether the problem is CF or the connector/web server.
Jon Hartmann's Gravatar @marc + @sean Thanks for the ideas, I'll pass this on to the server guys and see if they can find something. I'll post back here once I've gotten some results.
Remy Becher's Gravatar I'm a bit late to the party, but try enabling verbose logging on the connector (locate the connectors jrun.ini , set verbose=true and restart the IIS site or the entire service) to see detailed information what the connector is doing and more importantly what it's waiting for. Keep in mind this generates a lot of data quickly!
Jon Hartmann's Gravatar @Remy: Good idea, thanks for that! I'm still waiting on the engineering guys to let me know whats going on with further investigating this issue. They have told me that they aren't going to bother testing Apache vs IIS, but I think they might be looking into doing IIS on that machine.
Jon Hartmann's Gravatar I got some more information from our engineering team. They separated an instance from one of the CF clusters on the Komodo Pair, they then deleted the instance and recreated it with verbose logging to see if they could get any more info, and low and behold, it came up and was running smoothly. I think our next plan of action is to isolate one of the instances on the Jaguar Pair and rebuild its connects as well, to see how performance is affected there.
Jon Hartmann's Gravatar Rebuilding the connectors was the key. Two weeks of mucking around trying to figure out what to do for like a 2 second fix.
Comments are not allowed for this entry.
Jon Hartmann, July 2011

I'm Jon Hartmann and I'm a Javascript fanatic, UX/UI evangelist and former ColdFusion master. I blog about mysterious error messages, user interface design questions, and all things baffling and irksome about programming for the web.

Learn more about me on LinkedIn.