问题
Note: We are seeing this issue in our Cassandra 2.1.12.1047 (DSE 4.8.4) cluster with 6 nodes across 3 regions (2 in each region).
Trying to update schemas on our cluster recently, we found the updates were failing. We suspected one node in the cluster was not accepting the change.
When checking the system.peers
table of one of our servers in us-east-1, that it had an anomaly, it had what seemed to be a complete entry for a host that does not exist.
cassandra@cqlsh> SELECT peer, host_id FROM system.peers WHERE peer IN ('54.158.22.187', '54.196.90.253');
peer | host_id
---------------+--------------------------------------
54.158.22.187 | 8ebb7f2c-8f81-44af-814b-a537b84834e0
As that host did not exist, I tried to remove it using nodetool removenode
but that failed error: Cannot remove self
-- StackTrace --
java.lang.UnsupportedOperationException: Cannot remove self
We know that the .187
server was abruptly terminated a few weeks ago due to an EC2 issue.
We had numerous attempts at trying to make the server healthy, but then in the end simply terminated the server that was reporting this .187
host in the system.peers
, ran a nodetool removenode
from one of the other servers, and then brought a new server online.
The new server came online, and in an hour or so seemed to have caught up on the backlog of activity needed to bring it inline with the other servers (assumption based purely on CPU monitoring).
However, things are now very odd because the .187
that was reported in the system.peers
tables is appearing when we run a nodetool status
from any server in the cluster other than the new one we just brought online.
$ nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 54.158.22.187 ? 256 ? null r1
Datacenter: cassandra-ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 54.255.xx.xx 7.9 GB 256 ? a0c45f3f-8479-4046-b3c0-b2dd19f07b87 ap-southeast-1a
UN 54.255.xx.xx 8.2 GB 256 ? b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf ap-southeast-1b
Datacenter: cassandra-eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 176.34.xx.xxx 8.51 GB 256 ? 30ff8d00-1ab6-4538-9c67-a49e9ad34672 eu-west-1b
UN 54.195.xx.xxx 8.4 GB 256 ? f00dfb85-6099-40fa-9eaa-cf1dce2f0cd7 eu-west-1c
Datacenter: cassandra-us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 54.225.xx.xxx 8.17 GB 256 ? 0e0adf3d-4666-4aa4-ada7-4716e7c49ace us-east-1e
UN 54.224.xx.xxx 3.66 GB 256 ? 1f9c6bef-e479-49e8-a1ea-b1d0d68257c7 us-east-1d
As there is no way I know of to delete a node that does not have a Host ID, I am quite perplexed.
What can I do to get rid of this rogue node?
Note: Here is the result from a describecluster
$ nodetool describecluster
Cluster Information:
Name: XXX
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
d140bc9b-134c-3dbe-929f-7a84c2cd4532: [54.255.17.28, 176.34.207.151, 54.225.11.249, 54.195.174.72, 54.224.182.94, 54.255.64.1]
UNREACHABLE: [54.158.22.187]
回答1:
I've never had to do this myself, but probably the only thing left for you to do is to assassinate
the endpoint. This was made into a nodetool command (nodetool assassinate
) in Cassandra 2.2. But prior to that version, the only way to do it is via JMX. Here's a Gist with detailed instructions (instructions and code by Justen Walker).
Prerequisites
Log onto existing cluster alive node
Download JMX Term
wget
$ wget -q -O jmxterm.jar
> http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
> curl
or
$ curl -s -o jmxterm.jar
http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
- Run jmxterm
$ java -jar ./jmxterm.jar
Welcome to JMX terminal. Type "help" for available commands.
$>
Assassinate node
Example bad node: 10.0.0.100
- Connect to the local cluster
- Select the Gossiper MBean Run the unsafeAssassinateEndpoint with the ip of the bad node
$>open
localhost:7199
#Connection to localhost:7199 is opened
$>bean org.apache.cassandra.net:type=Gossiper
#bean is set to org.apache.cassandra.net:type=Gossiper
$>run unsafeAssassinateEndpoint 10.0.0.100
#calling operation unsafeAssassinateEndpoint of mbean org.apache.cassandra.net:type=Gossiper
#operation returns: null
$>quit
Update 20160308:
I've never had to do this myself
Just had to do this myself. Totally looked-up and followed the steps in my own answer, too.
来源:https://stackoverflow.com/questions/35751921/cassandra-host-in-cluster-with-null-id