How does Consul recover from losing quorum with changing node IPs?

自闭症网瘾萝莉.ら 提交于 2021-01-29 08:46:17

问题


I deployed Consul server using the Helm chart giving me a three node cluster. I could view the IP addresses and IDs of the nodes:

$ consul catalog nodes
Node             ID        Address     DC
consul-server-0  065ab1e4  10.60.1.11  dc1
consul-server-1  46eca681  10.60.0.16  dc1
consul-server-2  fb5fa37d  10.60.2.8   dc1

As a test I force deleted all three of these nodes as follows:

kubectl delete pods -n consul --force --grace-period=0 consul-server-0 consul-server-1 consul-server-2

Three new pods came up with different IPs but the same IDs, joined the cluster and achieved consensus again:

$ consul catalog nodes
Node             ID        Address     DC
consul-server-0  065ab1e4  10.60.1.12  dc1
consul-server-1  46eca681  10.60.2.9   dc1
consul-server-2  fb5fa37d  10.60.0.17  dc1

What does Consul rely on to recover from this situation? Can it form quorum again since the IDs are the same and then work out between them that the IPs have changed? Or is the names of the nodes staying consistent also a requirement for automatic recovery?

I see log messages such as:

consul: removed server with duplicate ID: 46eca681-b5d6-21e7-3df5-cf228ffdd02c

So it seems the changing IP address is causing a new node to be added to the cluster but then Consul works out that it needs to be removed. Because of this I would expect there to be 6 nodes at one point with 3 unavailable causing the cluster to lose quorum and not be able to recover automatically, but this does not happen.


回答1:


We also run consul in docker swarm and recovery after failure is not a trivial problem. Because failed server recreate in a new container, obviously, with different IP. Consul spring a lot of errors and messages about raft. But I did not see a serious problem with it. I just filter this kind of logs and not translate to long live indexes in elasticsearch.

We use the next config for a faster server recovery:

{
  "skip_leave_on_interrupt" : true,
  "leave_on_terminate" : true,
  "disable_update_check": true,
  "autopilot" : {
    "cleanup_dead_servers": true,
    "last_contact_threshold": "1s"
  }
}

You can review parameters here



来源:https://stackoverflow.com/questions/54915806/how-does-consul-recover-from-losing-quorum-with-changing-node-ips

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!