ColdFusion Instance Issue

ColdFusion Logo I'm running into a weird issue this morning at work. We have two servers behind a load balancer, and on those servers we have two ColdFusion instances clustered together (one on each, call them Thing1 and Thing2) both the hardware load balancer and the CF cluster are setup to round robin the requests. Every 20 minutes or so all you can get from one of the sites is 503 error pages. When looking into it, you can get to the ColdFusion Admin for the main instance on each server, and to Thing1's admin, but not to Thing2's CF Admin (throws 503s as well). From Thing2's main instance admin, you can go in and stop the instance, and immediately the 503 errors disappear. However, after about 5 second, Thing2 starts itself up again automatically, and 20 minutes or so later, things grind to a halt again.

I'm checking into the ColdFusion and JRUN logs now to see if I can find anything as to why this might be happening, and will update as I go. Click "More" to read about my ongoing battle.

Update 1: Having problems finding the log files... found them in JRun4\logs finally:

02/05 09:59:28 Information [jrpp-2] - Application Started
02/05 10:02:42 user FileServlet: init
java.lang.RuntimeException: Request timed out waiting for an available thread to run. You may want to consider increasing the number of active threads in the thread pool.
    at jrunx.scheduler.ThreadPool$Throttle.enter(ThreadPool.java:123)
    at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:425)
    at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)
java.lang.RuntimeException: Request timed out waiting for an available thread to run. You may want to consider increasing the number of active threads in the thread pool.
    at jrunx.scheduler.ThreadPool$Throttle.enter(ThreadPool.java:123)
    at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:425)
    at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)
java.lang.RuntimeException: Request timed out waiting for an available thread to run. You may want to consider increasing the number of active threads in the thread pool.
    at jrunx.scheduler.ThreadPool$Throttle.enter(ThreadPool.java:123)
    at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:425)
    at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)
java.lang.RuntimeException: Request timed out waiting for an available thread to run. You may want to consider increasing the number of active threads in the thread pool.
    at jrunx.scheduler.ThreadPool$Throttle.enter(ThreadPool.java:123)
    at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:425)
    at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)
02/05 10:16:34 error
[1]java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:443)
    at edu.emory.mathcs.backport.java.util.concurrent.TimeUnit.timedWait(TimeUnit.java:301)
    at edu.emory.mathcs.backport.java.util.concurrent.helpers.WaitQueue$WaitNode.doTimedWait(WaitQueue.java:79)
    at edu.emory.mathcs.backport.java.util.concurrent.Semaphore$FairSync.attempt(Semaphore.java:346)
    at edu.emory.mathcs.backport.java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:571)
    at coldfusion.CfmServlet.lock(CfmServlet.java:255)
    at coldfusion.CfmServlet.service(CfmServlet.java:174)
    at coldfusion.bootstrap.BootstrapServlet.service(BootstrapServlet.java:89)
    at jrun.servlet.FilterChain.doFilter(FilterChain.java:86)
    at coldfusion.monitor.event.MonitoringServletFilter.doFilter(MonitoringServletFilter.java:42)
    at coldfusion.bootstrap.BootstrapFilter.doFilter(BootstrapFilter.java:46)
    at jrun.servlet.FilterChain.doFilter(FilterChain.java:94)
    at jrun.servlet.FilterChain.service(FilterChain.java:101)
    at jrun.servlet.ServletInvoker.invoke(ServletInvoker.java:106)
    at jrun.servlet.JRunInvokerChain.invokeNext(JRunInvokerChain.java:42)
    at jrun.servlet.JRunRequestDispatcher.invoke(JRunRequestDispatcher.java:284)
    at jrun.servlet.ServletEngineService.dispatch(ServletEngineService.java:543)
    at jrun.servlet.jrpp.JRunProxyService.invokeRunnable(JRunProxyService.java:203)
    at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:428)
    at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)
[0]javax.servlet.ServletException
    at coldfusion.CfmServlet.service(CfmServlet.java:199)
    at coldfusion.bootstrap.BootstrapServlet.service(BootstrapServlet.java:89)
    at jrun.servlet.FilterChain.doFilter(FilterChain.java:86)
    at coldfusion.monitor.event.MonitoringServletFilter.doFilter(MonitoringServletFilter.java:42)
    at coldfusion.bootstrap.BootstrapFilter.doFilter(BootstrapFilter.java:46)
    at jrun.servlet.FilterChain.doFilter(FilterChain.java:94)
    at jrun.servlet.FilterChain.service(FilterChain.java:101)
    at jrun.servlet.ServletInvoker.invoke(ServletInvoker.java:106)
    at jrun.servlet.JRunInvokerChain.invokeNext(JRunInvokerChain.java:42)
    at jrun.servlet.JRunRequestDispatcher.invoke(JRunRequestDispatcher.java:284)
    at jrun.servlet.ServletEngineService.dispatch(ServletEngineService.java:543)
    at jrun.servlet.jrpp.JRunProxyService.invokeRunnable(JRunProxyService.java:203)
    at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:428)
    at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)

02/05 10:16:34 error
[1]java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:443)
    at edu.emory.mathcs.backport.java.util.concurrent.TimeUnit.timedWait(TimeUnit.java:301)
    at edu.emory.mathcs.backport.java.util.concurrent.helpers.WaitQueue$WaitNode.doTimedWait(WaitQueue.java:79)
    at edu.emory.mathcs.backport.java.util.concurrent.Semaphore$FairSync.attempt(Semaphore.java:346)
    at edu.emory.mathcs.backport.java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:571)
    at coldfusion.CfmServlet.lock(CfmServlet.java:255)
    at coldfusion.CfmServlet.service(CfmServlet.java:174)
    at coldfusion.bootstrap.BootstrapServlet.service(BootstrapServlet.java:89)
    at jrun.servlet.FilterChain.doFilter(FilterChain.java:86)
    at coldfusion.monitor.event.MonitoringServletFilter.doFilter(MonitoringServletFilter.java:42)
    at coldfusion.bootstrap.BootstrapFilter.doFilter(BootstrapFilter.java:46)
    at jrun.servlet.FilterChain.doFilter(FilterChain.java:94)
    at jrun.servlet.FilterChain.service(FilterChain.java:101)
    at jrun.servlet.ServletInvoker.invoke(ServletInvoker.java:106)
    at jrun.servlet.JRunInvokerChain.invokeNext(JRunInvokerChain.java:42)
    at jrun.servlet.JRunRequestDispatcher.invoke(JRunRequestDispatcher.java:284)
    at jrun.servlet.ServletEngineService.dispatch(ServletEngineService.java:543)
    at jrun.servlet.jrpp.JRunProxyService.invokeRunnable(JRunProxyService.java:203)
    at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:428)
    at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)
[0]javax.servlet.ServletException
    at coldfusion.CfmServlet.service(CfmServlet.java:199)
    at coldfusion.bootstrap.BootstrapServlet.service(BootstrapServlet.java:89)
    at jrun.servlet.FilterChain.doFilter(FilterChain.java:86)
    at coldfusion.monitor.event.MonitoringServletFilter.doFilter(MonitoringServletFilter.java:42)
    at coldfusion.bootstrap.BootstrapFilter.doFilter(BootstrapFilter.java:46)
    at jrun.servlet.FilterChain.doFilter(FilterChain.java:94)
    at jrun.servlet.FilterChain.service(FilterChain.java:101)
    at jrun.servlet.ServletInvoker.invoke(ServletInvoker.java:106)
    at jrun.servlet.JRunInvokerChain.invokeNext(JRunInvokerChain.java:42)
    at jrun.servlet.JRunRequestDispatcher.invoke(JRunRequestDispatcher.java:284)
    at jrun.servlet.ServletEngineService.dispatch(ServletEngineService.java:543)
    at jrun.servlet.jrpp.JRunProxyService.invokeRunnable(JRunProxyService.java:203)
    at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:428)
    at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)

Seems like number of threads might be the problem.

Update 2: Still trying to get a stack trace... the ColdFusion service isn't running under the local account name, so the instructions at http://kb.adobe.com/selfservice/viewContent.do?externalId=tn_18339&sliceId=1 aren't helping. Trying to get StackTrace on there now, as the applet thing wont load.

Update 3: Well, no luck there... StackTrace ended up dumping a bunch of stacks that weren't ColdFusion. I'm not certain what is going on there. Still no explanation for why the service restarts itself after being stopped.

Update 4: Found that other cluster instances restart after stop too... is that normal? I can't find documentation noting that behavior.

Update 5: Tried upping the activeHandlingThreads to 100 to buy myself some time (8 CPUs), but its turned out not to help much. I only got 30 seconds worth of data out of the server monitor before it locked up again.

Update 6: More false hopes. I changed some of the major queries in the app to use cachedWithin, and after that the server stayed up for about 6 minutes (twice what I was getting before), but now its back to 3 minutes or less.

Update 7: Used the CF API to try to make sure that I've killed off server monitoring on that box:

<cfscript>
// login to the Admin API
admin=createobject("component","CFIDE.adminapi.administrator");

// provide your admin password
admin.login("the_password");

// create the server monitoring
instancesmon = createobject("component","CFIDE.adminapi.servermonitoring");

instancesmon.stopMonitoring();
</cfscript>

And shortly there after the server tanked again.

Update 8: Ok, so Thing2 is still flipping out. If this was really a bottleneck issue, or a traffic issue, shouldn't Thing1 be taken down as well? I takes up the slack as soon as Thing2 is told to stop, and I confirmed that its been up and running since the main CF instance was started about 3 days ago.

 

Comments

David's Gravatar Jon - we have a near identical setup to you and while we don't have this problem, we do have some other that are similar.

Every now and then - there really is no rhythm to it - one of our instances "hangs". It just stops taking traffic, and hte user trying to get there times out. There are no errors thrown, we have Fusion Reactor and it doesn't indicate any error. The JRUN logs don't record any activity. We have to stop and start the service in order for it to get going again.

I'm not sure what else to say - it's not a site volume problem, this can happen at 4am on a Sunday morning. It's not a server resource issue, we've checked, and there are no backups or anything like that. We've checked the Firewall, and there was no DOS attack.

We're perplexed, but are moving to new servers with more memory and CF8.1, so we're hoping and waiting. We invested a lot of money in enterprise with the promise of uptime, so we're getting a black eye on this :-/

I'll let you know if we find anything out.

Cheers,

Davo
marc esher's Gravatar Is it possible that server monitor is causing your problems?
Jon Hartmann's Gravatar Thanks for your comments Davo... that sounds equally frustrating. At least you're getting more then 3-6 minutes out of the instance though, hah :)
Jon Hartmann's Gravatar @Marc: Its possible that it didn't help the situation, but it was not running before the problem started.I know that monitoring can cause problems of its own, so I was only trying to use it long enough to diagnose the problem. Monitoring should be off now, but I'm having problems connecting to the admin long enough to double check.
marc esher's Gravatar jon, this is a great problem! i mean, not for you, obviously, but to watch how you're going about solving it.

i wonder what would happen if:

1) you changed the login user for instance 2.
2) you deleted and recreated instance 2
3) you created a 3rd instance, added that to the cluster, and removed instance 2 from the cluster
4) you kept instance 2 running but removed it from the cluster, then tried running various things on instance 2 to see how long it stayed up
5) if you have beyond compare or a similar tool, run a file comparison between the instance1 directory and instance2 directory. i know this sounds stupid, but you never know... maybe an xml file got borked while something funky was happening on instance2. I had this happen to me a while back (an xml file got hosed during a server crash)
Jon Hartmann's Gravatar @Marc:

Ha ha, yeah if this was happening to someone else I'd be fascinated too :) One of our server guys is supposed to look at the situation this evening, and I'm going to recommend that the do number 5, 3, and 4. If he compares them, makes a new instance, say Thing3, and swaps it in for Thing2 and then leaves Thing2 for me to take a look at tomorrow, I think thats the best I can do for now. I'm convinced that this is something instance specific, since if it was traffic or code related, it should tank Thing1, as its the same code base (Windows DFS from a third location). I'll update once I find out more.
Jon Hartmann's Gravatar Our server guy didn't do anything with it last night. We're attempting to isolate the instance so we can leave it running to take a look later. We've disabled the forward facing NIC to the server so that the hardware load balancer will stop sending requests to that machine, and we are Thing2 from the cluster. Not sure how soon we will be able to find out whats going on.
FusionReactor's Gravatar There is a thread on the FusionReactor groups with someone experiencing the same issue. Take a look and see if this helps you....

http://groups.google.com/group/fusionreactor/brows...
mikeh's Gravatar Hi. I wanted to let you know I had a very similar problem to yours, in fact our stack dumps were almost exactly the same. I found that I had the CF server monitor was running for the CF instance that exhibited the problem. I should have suspected the CF server monitor was involved in this problem from this line in the stack dump.

at coldfusion.monitor.event.MonitoringServletFilter.doFilter(MonitoringServletFilter.java:42)

I logged into the CFIDE and turned off all three elements of the server monitor and the problem went away completely. I have a feeling you're in the same boat.

Lots of Luck,

Mike
Jon Hartmann's Gravatar @Mike: Yeah, the monitor was definitely part of the problem, but the server wouldn't stay stable long enough for me to actually log in and turn off the monitoring. I tried using the snippet I listed above to clear things out, but that didn't seem to be a help. The decision came down to just ax the server instance and re-install rather than try to figure out what was going wrong.
stephen's Gravatar Exactly the same problem here. 4 CF8 instances running inside BEA weblogic 10 servers. all behind load balances all separate instances.

I monitor my 503 error svia a script and we see a gradual increase until eventaully 503s become and issue and then we have to restart. it happens
approx evey two months.
Jon Hartmann's Gravatar @Stephen: Thats really interesting... its starting to look like this might be an issue with server stability in general, rather than an issue with a particular configuration; I mean, you're running it on a completely different server than I was, and its having similar issues.

I'm wondering if this is related to CF Clustering, since I don't see anyone saying they have this problem when running a single instance.
stephen's Gravatar All my 4 cf 8 servers are stand alone, independent all with the same config and the load balancer takes care of session persistence so Im not using the CF clustering, so its probably not that. I am running Enterprise 8,0,0,176276

Once I get our support info for adobe and I will log a case. I will let you know what I find.

I have just turned on the logging option to log slow pages that take longer than 30 seconds to process. I am planning on ramping this setting up to try and identify which pages are causing problems.
Jon Hartmann's Gravatar @Stephen: Now that is interesting... no clustering and still similar problems? Let me know what you find out!
Jason's Gravatar Hello everyone,

I've had the similar issue take place at our website, but only for about five minutes and it has not happen again.

There was a very resource-heavy file that we were serving to our users, which we have since taken off line completely and put up a lighter version of it.

Could it be possible that it is the functionality of one of your templates that is causing an overload on the system and causing the 503 errors?
digerati's Gravatar While turning off the CF monitoring within the CF admin in production is highly reccomended, we had an issue occur like this with a combination of Load balancer, cloud instances and dedicated.

Check the following:

Under advanced DSN settings in the CF admin, there is a maintain connections option. Make sure this is enabled as it will kill a production box because it will open a new connection for each query.

There was also an issue of the time limit in relation to mysql server connection time limit, where mysqld default timeout was 19 mins and CF's database connection limit was set to 20 mins, causing the timeout every 20 mins, a problem mentioned earlier in this thread.

Just something to check. Hope this helps someone.

Rock.
Comments are not allowed for this entry.
Jon Hartmann, July 2011

I'm Jon Hartmann and I'm a Javascript fanatic, UX/UI evangelist and former ColdFusion master. I blog about mysterious error messages, user interface design questions, and all things baffling and irksome about programming for the web.

Learn more about me on LinkedIn.