Originally posted on Server Fault, where it was suggested this question might better asked here.
We are using JBoss to run two of our WARs. One is our web app, the other
If you are using JBoss 5.1.0 EAP, there is a bug in Jboss and they also have a fix. Here is the URL: https://issues.jboss.org/browse/JBPAPP-5193
This typically happens with runaway code or unsafe thread access to hashmaps. A simple thread dump (kill -3, as @disown says, or ctrl-break in a windows console) will reveal this problem.
Since you're unable to reproduce it using tests I think it smells like a concurrency issue; it's usually hard to make test scripts behave sufficiently random to catch issues of this type.
I normally try to make it standard operating procedure to do thread-dumps of any JVM that is restarted due to operational anomalies, and it's really a requirement to catch those once-a-month things.
There's a quick and dirty way of identifying which threads are using up the CPU time on JBoss. Go the the JMX Console with a browser (usually on http://localhost:8080/jmx-console, but may be different for you), look for a bean called ServerInfo
, it has an operation called listThreadCpuUtilization
which dumps the actual CPU time used by each active thread, in a nice tabular format. If there's one misbehaving, it usually stands out like a sore thumb.
There's also the listThreadDump
operation which dumps the stack for every thread to the browser.
Not as good as a profiler, but a much easier way to get the basic information. For production servers, where it's often bad news to connect a profiler, it's very handy.
I think you should definitely try to set up a test environment with some load testing in order to reproduce your issue. Profiling would definitely help in order to pinpoint the problem.
A quick fix would be to next time kill jboss with kill -3 in order get a dump to analyze. Second thing I would check is that you are running with -server flags and that your gc settings are sane. You could also just run some dstat to see what the process is doing during the lockup. But again - it is probably safer to just set up a load testing environment (via EC2 or so) to reproduce this.