问题
FYI. We are running this test with Cassandra 2.1.12.1047 | DSE 4.8.4
We have a simple table in Cassandra that has a 5,000 rows of data in it. Some time back, as a precaution, we added monitoring on each Cassandra instance to ensure that it has 5,000 rows of data because our replication factor enforces this i.e. we have 2 replicas in every region and we have 6 servers in total in our dev cluster.
CREATE KEYSPACE example WITH replication = {'class': 'NetworkTopologyStrategy', 'ap-southeast-1-A': '2', 'eu-west-1-A': '2', 'us-east-1-A': '2'} AND durable_writes = true;
We recently forcibly terminated a server to simulate a failure and brought a new one online to see what would happen. We also removed the old node using nodetool removenode
so that in each region we expected all data to exist on every server.
Once the new server came online, it joined the cluster, and seemingly started replicating the data. We assume because it is in bootstrap mode it will be responsible for ensuring it gets the data it needs from the cluster. CPU finally dropped after around an hour, and we assumed the replication was complete.
However, our monitors, which intentionally do queries using LOCAL_ONE
on each server, reported that all servers had 5,000 rows, and the new server that was brought online was stuck with around 2,600 rows. We assumed that perhaps it was still replicating so we left it a while, but it stayed at that number.
So we ran nodetool status to check and got the following:
$ nodetool status my_keyspace
Datacenter: ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 54.255.17.28 7.9 GB 256 100.0% a0c45f3f-8479-4046-b3c0-b2dd19f07b87 ap-southeast-1a
UN 54.255.64.1 8.2 GB 256 100.0% b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf ap-southeast-1b
Datacenter: eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 176.34.207.151 8.51 GB 256 100.0% 30ff8d00-1ab6-4538-9c67-a49e9ad34672 eu-west-1b
UN 54.195.174.72 8.4 GB 256 100.0% f00dfb85-6099-40fa-9eaa-cf1dce2f0cd7 eu-west-1c
Datacenter: us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 54.225.11.249 8.17 GB 256 100.0% 0e0adf3d-4666-4aa4-ada7-4716e7c49ace us-east-1e
UN 54.224.182.94 3.66 GB 256 100.0% 1f9c6bef-e479-49e8-a1ea-b1d0d68257c7 us-east-1d
So if the server is reporting that it owns 100% of the data, why is the LOCAL_ONE
query only giving us roughly half the data?
When I did run a LOCAL_QUORUM
query it returned 5,000 rows, and from that point forwards returned 5,000 even for LOCAL_ONE
queries.
Whilst LOCAL_QUORUM
solved the problem in this instance, we may in future need to do other types of queries on the assumption that each server a) has the data it should have, b) knows how to satisfy queries when it does not have the data i.e. it knows that data sits somewhere else on the ring.
FURTHER UPDATE 24 hours later - PROBLEM IS A LOT WORSE
So in the absence of any feedback on this issue, I have proceeded to experiment with this on the cluster by adding more nodes. According to https://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_add_node_to_cluster_t.html, I have followed all the steps recommended to add nodes to the cluster and in effect, add capacity. I believe the premise of Cassandra is that as you add nodes, it is the Cluster's responsibility to rebalance the data and during that time, get the data from the position on the ring it is at if it's not where it should be.
Unfortunately that is not the case at all. Here is my new ring:
Datacenter: ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 54.255.xxx.xxx 8.06 GB 256 50.8% a0c45f3f-8479-4046-b3c0-b2dd19f07b87 ap-southeast-1a
UN 54.254.xxx.xxx 2.04 MB 256 49.2% e2e2fa97-80a0-4768-a2aa-2b63e2ab1577 ap-southeast-1a
UN 54.169.xxx.xxx 1.88 MB 256 47.4% bcfc2ff0-67ab-4e6e-9b18-77b87f6b3df3 ap-southeast-1b
UN 54.255.xxx.xxx 8.29 GB 256 52.6% b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf ap-southeast-1b
Datacenter: eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 54.78.xxx.xxx 8.3 GB 256 49.9% 30ff8d00-1ab6-4538-9c67-a49e9ad34672 eu-west-1b
UN 54.195.xxx.xxx 8.54 GB 256 50.7% f00dfb85-6099-40fa-9eaa-cf1dce2f0cd7 eu-west-1c
UN 54.194.xxx.xxx 5.3 MB 256 49.3% 3789e2cc-032d-4b26-bff9-b2ee71ee41a0 eu-west-1c
UN 54.229.xxx.xxx 5.2 MB 256 50.1% 34811c15-de8f-4b12-98e7-0b4721e7ddfa eu-west-1b
Datacenter: us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 54.152.xxx.xxx 5.27 MB 256 47.4% a562226a-c9f2-474f-9b86-46c3d2d3b212 us-east-1d
UN 54.225.xxx.xxx 8.32 GB 256 50.3% 0e0adf3d-4666-4aa4-ada7-4716e7c49ace us-east-1e
UN 52.91.xxx.xxx 5.28 MB 256 49.7% 524320ba-b8be-494a-a9ce-c44c90555c51 us-east-1e
UN 54.224.xxx.xxx 3.85 GB 256 52.6% 1f9c6bef-e479-49e8-a1ea-b1d0d68257c7 us-east-1d
As you will see, I have doubled the size of the ring and the effective ownership is roughly 50% per server as expected (my replication factor is 2 copies in every region). However, owrringly you can see that some servers have absolutely no load on them (they are new), whilst others have excessive load on them (they are old and clearly no distribution of data has occurred).
Now this in itself is not the worry as I believe in the powers of Cassandra and its ability to eventually get the data in the right place. The thing that worries me immensely is that my table with exactly 5,000 rows now no longer has 5,000 rows in any of my three regions.
# From ap-southeast-1
cqlsh> CONSISTENCY ONE;
Consistency level set to ONE.
cqlsh> select count(*) from health_check_data_consistency;
count
-------
3891
cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.
cqlsh> select count(*) from health_check_data_consistency;
count
-------
4633
# From eu-west-1
cqlsh> CONSISTENCY ONE;
Consistency level set to ONE.
cqlsh> select count(*) from health_check_data_consistency;
count
-------
1975
cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.
cqlsh> select count(*) from health_check_data_consistency;
count
-------
4209
# From us-east-1
cqlsh> CONSISTENCY ONE;
Consistency level set to ONE.
cqlsh> select count(*) from health_check_data_consistency;
count
-------
4435
cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.
cqlsh> select count(*) from health_check_data_consistency;
count
-------
4870
So seriously, what is going on here? Lets recap:
- my replication factor is
'ap-southeast-1-A': '2', 'eu-west-1-A': '2', 'us-east-1-A': '2'
so every region should be able to satisfy a query in full. - Bringing on new instances should not cause me to have data loss, yet apparently we do even with LOCAL_QUORUM
- Every region has a different view on the data yet I have not introduced any new data, only new servers that then bootstrap automatically.
So then I thought, why not do a QUORUM
query across the entire multi-region cluster. Unfortunately that fails completely:
cqlsh> CONSISTENCY QUORUM;
Consistency level set to QUORUM.
cqlsh> select count(*) from health_check_data_consistency;
OperationTimedOut: errors={}, last_host=172.17.0.2
I then turned TRACING ON;
and that failed too. All I can see in the logs is the following:
INFO [SlabPoolCleaner] 2016-03-03 19:16:16,616 ColumnFamilyStore.java:1197 - Flushing largest CFS(Keyspace='system_traces', ColumnFamily='events') to free up room. Used total: 0.33/0.00, live: 0.33/0.00, flushing: 0.00/0.00, this: 0.02/0.02
INFO [SlabPoolCleaner] 2016-03-03 19:16:16,617 ColumnFamilyStore.java:905 - Enqueuing flush of events: 5624218 (2%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:1126] 2016-03-03 19:16:16,617 Memtable.java:347 - Writing Memtable-events@732346653(1.102MiB serialized bytes, 25630 ops, 2%/0% of on/off-heap limit)
INFO [MemtableFlushWriter:1126] 2016-03-03 19:16:16,821 Memtable.java:382 - Completed flushing /var/lib/cassandra/data/system_traces/events/system_traces-events-tmp-ka-3-Data.db (298.327KiB) for commitlog position ReplayPosition(segmentId=1456854950580, position=28100666
)
INFO [ScheduledTasks:1] 2016-03-03 19:16:21,210 MessagingService.java:929 - _TRACE messages were dropped in last 5000 ms: 212 for internal timeout and 0 for cross node timeout
This happens on every single server I run the query on.
Checking the cluster, it seems everything is in sync
$ nodetool describecluster;
Cluster Information:
Name: Ably
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
51e57d47-8870-31ca-a2cd-3d854e449687: [54.78.xxx.xxx, 54.152.xxx.xxx, 54.254.xxx.xxx, 54.255.xxx.xxx, 54.195.xxx.xxx, 54.194.xxx.xxx, 54.225.xxx.xxx, 52.91.xxx.xxx, 54.229.xxx.xxx, 54.169.xxx.xxx, 54.224.xxx.xxx, 54.255.xxx.xxx]
FURTHER UPDATE 1 hour later
Someone suggested that perhaps this was simply down to range queries not working as expected. I thus wrote a simple script that queried for each of the 5k rows individually (they have an ID range 1->5,000). Unfortunately the results are as I feared, I have missing data. I have tried this with LOCAL_ONE
, LOCAL_QUORUM
and event QUORUM
.
ruby> (1..5000).each { |id| put "#{id} missing" if session.execute("select id from health_check_data_consistency where id = #{id}", consistency: :local_quorum).length == 0 }
19 missing, 61 missing, 84 missing, 153 missing, 157 missing, 178 missing, 248 missing, 258 missing, 323 missing, 354 missing, 385 missing, 516 missing, 538 missing, 676 missing, 708 missing, 727 missing, 731 missing, 761 missing, 863 missing, 956 missing, 1006 missing, 1102 missing, 1121 missing, 1161 missing, 1369 missing, 1407 missing, 1412 missing, 1500 missing, 1529 missing, 1597 missing, 1861 missing, 1907 missing, 2005 missing, 2168 missing, 2207 missing, 2210 missing, 2275 missing, 2281 missing, 2379 missing, 2410 missing, 2469 missing, 2672 missing, 2726 missing, 2757 missing, 2815 missing, 2877 missing, 2967 missing, 3049 missing, 3070 missing, 3123 missing, 3161 missing, 3235 missing, 3343 missing, 3529 missing, 3533 missing, 3830 missing, 4016 missing, 4030 missing, 4084 missing, 4118 missing, 4217 missing, 4225 missing, 4260 missing, 4292 missing, 4313 missing, 4337 missing, 4399 missing, 4596 missing, 4632 missing, 4709 missing, 4786 missing, 4886 missing, 4934 missing, 4938 missing, 4942 missing, 5000 missing
As you can see from above, that means I have roughly 1.5% of my data no longer available.
So I am stumped. I really need some advice here because I was certainly under the impression that Cassandra was specifically designed to handle scaling out horizontally on demand. Any help greatly appreciated.
回答1:
Regarding ownership. This is based on token ownership, not actual data. So the reported ownership in each case looks correct regardless of data volume on each node.
Second, you can’t guarantee consistency with two nodes (unless you sacrifice availability and use CL=ALL). QUORUM = majority. You need at least three nodes per DC to truly get quorum. If consistency is important to you deploy three nodes per DC and do QUORUM reads and writes.
SELECT count(*) across DCs is going to time out. There’s probably several hundred ms of latency between your us and ap datacenters. Plus select count(*) is an expensive operation.
When you do a QUORUM read Cassandra is going to fix inconsistent data with a read repair. That’s why your counts are accurate after you run the query at quorum.
All that being said, you do seem to have a bootstrap problem because new nodes aren’t getting all of the data. First I’d run a repair on all the nodes and make sure they all have 5,000 records after doing so. That’ll let you know streaming isn’t broken. Then repeat the node replace like you did before. This time monitor with nodetool netstats and watch the logs. Post anything strange. And don’t forget you have to run nodetool cleanup to remove data from the old nodes.
Can you describe your hardware config (RAM, CPU, disk, etc.)?
回答2:
What I should have said is you can't guarantee consistency AND availability. Since your quorum query is essentially an ALL query. The only way to query when one of the nodes is down would be to lower CL. And that won't do a read repair if data on the available node is inconsistent.
After running repair you also need to run cleanup on the old nodes to remove the data they no longer own. Also, repair won't remove deleted/TTLd data until after the gc_grace_seconds period. So if you have any of that, it'll stick around for at least gc_grace_seconds.
Did you find anything in the logs? Can you share your configuration?
来源:https://stackoverflow.com/questions/35752291/local-one-and-unexpected-data-replication-with-cassandra