问题
My cluster of size 2 had entered into somewhat inconsistent state. On one node (call it node A) nodetool status was correctly showing 2 nodes. While on another node (call it B) it was showing only one i.e. itself. After several attempts I could not fixed the issue. So I decommissioned node B. But nodetool status on node A was still showing the node B that to in UN state. I had to restart cassandra on node A so that it forget node B.
But this has lead to another problem. I am making new node (call it C) to join the cluster of node A. But that node is taking hours. It's already six hours and I am wondering whether it will successfully join finally.
Looking at debug logs of node C suggest that node B (the decommissioned one) is causing trouble. Logs at node C are constantly showing:
DEBUG [GossipTasks:1] 2017-04-29 12:38:40,004 Gossiper.java:337 - Convicting /10.120.8.53 with status removed - alive false
Nodetool status on node A is showing the node C in joining state as expected.
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UJ 10.120.8.113 1006.97 MiB 256 ? f357d8d0-2379-43d8-8ae5-62224191fb6c rack1
UN 10.120.8.23 5.29 GiB 256 ? 596260a0-785a-435c-a3f3-632f56c5c882 rack1
Load for node C increases in fraction after couple of hours.
I checked whether system.peers contains node B. But the table contains zero rows.
I am using cassandra 3.7.
What's going wrong. What can I do to avoid losing data on node A and still scale the cluster?
回答1:
Run nodetool netstats on node C and see if there's is a progress going on. Also review nodetool compactionstats, see amount of compactions pending, and see if it goes down with time.
If the bootstraping failed, try restarting the node.
As an alternative, you can remove node C and add it once again, with auto_bootstrap setting set to false. After the node is up, run nodetool rebuild, and nodetool repair after the process - should be a faster alternative for regular bootstrap.
来源:https://stackoverflow.com/questions/43696127/cassandra-node-is-taking-hours-to-join