问题
I'm running a 3x node Galera Cluster under MariaDB. The application is in PHP using the mysqli extension.
Very occasionally I get a Deadlock on write. I'm working on improving my application to handle or avoid that kind of failure, but in the mean time I need the cluster to stay up when this happens.
The problem is that as soon as the deadlock occurs, not just one, but all three nodes in the cluster crash. The node where the deadlock originates suffers the MySQL server has gone away error and after max_connect_errors
starts refusing connections permanently, thus requiring a manual server restart.
What I don't get is why the other nodes go down too. They both start erroring with "WSREP has not yet prepared node for application use" which means the entire application crashes with no database nodes accepting connections.
How can I ensure that the rest of the cluster stays up when one node suffers an albeit rare deadlock?
Update:
A month later and another deadlock causes a similar problem. Again, one node brings down everything.
The first connection gets a deadlock (at commit phase) so the application tries to replay the transaction. This hangs for almost a minute and fails again.
After the first connection fails to recover, all other connections start failing with (1205) "Lock wait timeout exceeded" rendering the entire cluster useless.
I should add that the application does not use locks. However it got itself tied in a knot, it's just with regular transactional queries.
回答1:
I'm answering my own question as I've managed to avoid crashes. However, I still have problems with secondary errors and have started a new thread with the specifics.
My recovery code now handles secondary errors differently. It will retry deadlocks a couple of times, but only while the error is a deadlock. If any other type of error occurs the application will give up.
Although this means disappointed users receiving errors, I haven't had a cluster crash since this change and haven't seen the dreaded "server gone away" error.
来源:https://stackoverflow.com/questions/44659105/how-to-stop-a-deadlock-on-one-node-from-crashing-entire-cluster