Oracle updates/inserts stuck, DB CPU at 100%, concurrency high, SQL*Net wait message from client

问题

We have a JavaEE app running on Weblogic against Oracle 11g DB, using thin JDBC driver. Recently we had a series of incidents in production where updates and inserts into a certain table got stuck or took much longer than normal, with no apparent reason. This caused the application to use more and more DB connections (normally idle in the connection pool), the DB CPU and concurrency shot up (as seen in OEM) and the whole DB ground to a halt. During these incidents the DBAs could not find any reason for Inserts and Updates to be stuck (no db locks). What they did see were a lot of "SQL*Net wait message from client" events.

Their theory is that the app (the jdbc client) got stuck, somehow, during insert/update statements, for a reason unrelated to the DB, while not acknowledging the DB response to these statements. And the fact that the app continued issuing more and more of these statements tying up more and more connections, was the reason that the CPU and concurrency shot up, making the DB unresponsive.

I'm not convinced - if all the sessions were busy waiting for clients, how come the CPU was so high? We weren't able to consistently reproduce these incidents so we are really in the dark here...

Has anyone seen anything like this or have any ideas, suggestions of what this might be caused by?

Thanks

回答1:

What you're describing is a "connection storm". A badly configured connection pool will "handle" slowly responding connections by opening new connections to service waiting requests. These additional requests place further strain on a server which is already stressed (if it wasn't stressed the initial connections wouldn't be lagging). This initiates a cycle of poor response spawning additional connections which eventually kill the server.

You can avoid the connection storm by setting the Maximum Capacity of the data source to something reasonable. The definition of "reasonable" will vary according to the capabilities of your servers, but it is probably lower than you think. The best advice is to set the Maximum Capacity to the same value as Initial Capacity.

Once you prevent the Connection Storm you can focus on the database process(es) which cause the initial slowdown.

The high number of SQL*Net wait message from client events indicate that the client is doing something without contacting the database. That is why your DBAs reason that the problem lies with the app.

回答2:

I've encountered a similar issue, which I've documented here: Unkillable Oracle session waiting on "SQL*Net message from client" event. In my case, the problem was caused by a bind variable of type CLOB that was bound to a place where CLOBs seem to cause severe issues in Oracle. The following statement produces the same behaviour as you've observed:

CREATE TABLE t (
  v INT, 
  s VARCHAR2(400 CHAR)
);

var v_s varchar2(50)
exec :v_s := 'abc'

MERGE INTO t                      
USING (
  SELECT 
    1 v, 
    CAST(:v_s AS CLOB) s 
  FROM DUAL
) s 
ON (t.s = s.s) -- Using a CLOB here causes the bug.
WHEN MATCHED THEN UPDATE SET
  t.v = s.v        
WHEN NOT MATCHED THEN INSERT (v, s) 
VALUES (s.v, s.s);

Probably, there are other occasions with other statements than MERGE that expose this behaviour producing zombie sessions as well, as Oracle seems to run some infinite loop producing the observed CPU load.

来源：https://stackoverflow.com/questions/14043668/oracle-updates-inserts-stuck-db-cpu-at-100-concurrency-high-sqlnet-wait-mes

标签

Oracle

jdbc

database-administration