Cassandra node can't complete joining operation

问题

Trying to add a new node to an existing C* 2.1.11 cluster, the node appears to have completed the streaming phase of the bootstrap, but I can't find an explanation of why it has not moved from the JOINING state; the cassandra logs for all the nodes don't show errors during all the streaming process.

nodetool status reports the node as UJ in all the nodes, and the amount of load is greater that the rest of nodes:

# nodetool status
Datacenter: us-east-vpc
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  xx.xx.xx.78   564.96 GB  256     ?       xxxx-f3c7d9d40e92  1d
UN  xx.xx.xx.110  534.63 GB  256     ?       xxxx-9419faa478ca  1a
UN  xx.xx.xx.171  557.13 GB  256     ?       xxxx-7a5b2723e438  1a
UN  xx.xx.xx.203  406.98 GB  256     ?       xxxx-1331d9c44992  1a
UN  xx.xx.xx.26   579.55 GB  256     ?       xxxx-88b202a8cedc  1c
UN  xx.xx.xx.122  603.39 GB  256     ?       xxxx-b0b81ebabeb2  1d
UN  xx.xx.xx.233  565.3 GB   256     ?       xxxx-a2fa9ad67741  1c
UJ  xx.xx.xx.56   881.91 GB  256     ?       xxxx-9863c7799fad  1d

nodetool netstats shows no activity in the other nodes but on the new one which shows an empty list of files to transmit:

# nodetool netstats
Mode: JOINING
Bootstrap xxxx-8d0c340f238b
    /xx.xx.xx.233
    /xx.xx.xx.122
    /xx.xx.xx.171
    /xx.xx.xx.78
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed
Commands                        n/a         0             50
Responses                       n/a         0          64941

nodetool info is throwing an error while trying to retrieve the token range information:

# nodetool info
ID                     : xxxx-9863c7799fad
Gossip active          : true
Thrift active          : false
Native Transport active: false
Load                   : 881.91 GB
Generation No          : 1475450119
Uptime (seconds)       : 12081
Heap Memory (MB)       : 1480.71 / 1996.00
Off Heap Memory (MB)   : 204.47
Data Center            : us-east-vpc
Rack                   : 1d
Exceptions             : 2
Key Cache              : entries 3262, size 788.43 KB, capacity 99 MB, 43 hits, 3249 requests, 0.013 recent hit rate, 14400 save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 49 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
error: null
-- StackTrace --
java.lang.AssertionError
    at org.apache.cassandra.locator.TokenMetadata.getTokens(TokenMetadata.java:474)
    at org.apache.cassandra.service.StorageService.getTokens(StorageService.java:2263)
    at org.apache.cassandra.service.StorageService.getTokens(StorageService.java:2252)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
    at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
    at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
    at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
    at com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:83)
    at com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:206)
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:647)
    at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678)
    at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1445)
    at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
    at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309)
    at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401)
    at javax.management.remote.rmi.RMIConnectionImpl.getAttribute(RMIConnectionImpl.java:639)
    at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:324)
    at sun.rmi.transport.Transport$1.run(Transport.java:200)
    at sun.rmi.transport.Transport$1.run(Transport.java:197)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:683)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Any help will be greatly appreciated.

EDIT Oct 3 It was found that the instance was running out of space, at the end we got an error that there was not enough space to complete compactions. The partition was expanded and the /data folder cleared to start the bootstrap from scratch; With the expanded disk, the streaming completed, but it still can't move from UJ to UN; there are no errors on the logs, nodetool tpstats show no pending tasks, nodetool netstats returned no pending activity, with the same bootstrap UUID:

# nodetool netstats
Mode: JOINING
Bootstrap xxxx-8d0c340f238b
    /xx.xx.xx.233
    /xx.xx.xx.122
    /xx.xx.xx.171
    /xx.xx.xx.78
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed
Commands                        n/a         0            130
Responses                       n/a         0         256088

There is still the question of why the increment of load for that node happened

回答1:

As there were no errors reported, and the streaming process was done, we assumed that the node was ready to join the cluster.

We added the auto_bootstrap: False directive to the cassandra.yaml file, restarted the service in the node, and it joined the cluster.

After joining the cluster a full repair and a cleanup were executed.

来源：https://stackoverflow.com/questions/39823972/cassandra-node-cant-complete-joining-operation

标签

cassandra

cassandra-2.1