How to stop a deadlock on one node from crashing entire cluster?

大兔子大兔子 提交于 2019-12-11 02:22:03

问题


I'm running a 3x node Galera Cluster under MariaDB. The application is in PHP using the mysqli extension.

Very occasionally I get a Deadlock on write. I'm working on improving my application to handle or avoid that kind of failure, but in the mean time I need the cluster to stay up when this happens.

The problem is that as soon as the deadlock occurs, not just one, but all three nodes in the cluster crash. The node where the deadlock originates suffers the MySQL server has gone away error and after max_connect_errors starts refusing connections permanently, thus requiring a manual server restart.

What I don't get is why the other nodes go down too. They both start erroring with "WSREP has not yet prepared node for application use" which means the entire application crashes with no database nodes accepting connections.

How can I ensure that the rest of the cluster stays up when one node suffers an albeit rare deadlock?


Update:

A month later and another deadlock causes a similar problem. Again, one node brings down everything.

The first connection gets a deadlock (at commit phase) so the application tries to replay the transaction. This hangs for almost a minute and fails again.

After the first connection fails to recover, all other connections start failing with (1205) "Lock wait timeout exceeded" rendering the entire cluster useless.

I should add that the application does not use locks. However it got itself tied in a knot, it's just with regular transactional queries.


回答1:


I'm answering my own question as I've managed to avoid crashes. However, I still have problems with secondary errors and have started a new thread with the specifics.

My recovery code now handles secondary errors differently. It will retry deadlocks a couple of times, but only while the error is a deadlock. If any other type of error occurs the application will give up.

Although this means disappointed users receiving errors, I haven't had a cluster crash since this change and haven't seen the dreaded "server gone away" error.



来源:https://stackoverflow.com/questions/44659105/how-to-stop-a-deadlock-on-one-node-from-crashing-entire-cluster

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!