问题
I have very huge Cassandra table containing over 1 billion records. My primary key forms like this: "(partition_id, cluster_id1, cluster_id2)
". Now for several particular partition_id, I have too many records that I can't run row count on these partition keys without timeout exception raised.
What I ran in cqlsh is:
SELECT count(*) FROM relation WHERE partition_id='some_huge_partition';
I got this exception:
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
I tried to set --connect-timeout
and --request-timeout
, no luck. I counted same data in ElasticSearch, the row count is approximately 30 million (the same partition).
My Cassandra is 3.11.2 and CQLSH is 5.0.1. The Cassandra cluster contains 3 nodes and each has more 1T HDD(fairly old servers, more than 8 years).
So in short, my questions are:
- How can I count it? is it even possible to count a huge partition in Cassandra?
- Can I use COPY TO command with partition key as it's filter, so I can count it in the exported CSV file?
- Is there a way I can monitor the insert process before any partition getting too huge?
Big thanks advanced.
回答1:
Yes, working with large partitions is difficult with Cassandra. There really isn't a good way to monitor particular partition sizes, although Cassandra will warn about writing large partitions in your system.log
. Unbound partition growth is something you need to address during the creation of your table, and it involves adding an additional (usually time based) partition key derived from understanding your business use case.
The answer here, is that you may be able to export the data in the partition using the COPY
command. To keep it from timing out, you'll want to use the PAGESIZE
and PAGETIMEOUT
options, kind of like this:
COPY products TO '/home/aploetz/products.txt'
WITH DELIMITER='|' AND HEADER=true
AND PAGETIMEOUT=40 AND PAGESIZE=20;
That will export the products
table to a pipe-delimited file, with a header, at a page size of 20 rows at a time and with a 40 second timeout for each page fetch.
If you still get timeouts, try decreasing PAGESIZE
and/or increasing PAGETIMEOUT
.
回答2:
I've found that with Spark and the awesome Spark Cassandra Connector library, I can finally count a large table without encountering any of the timeout limitations. The Python Spark code is like this:
tbl_user_activity = sqlContext.read.format("org.apache.spark.sql.cassandra").options(keyspace='ks1', table='user_activity').load()
tbl_user_activity.where('id = 1').count()
It will run for a while but in the end it works.
来源:https://stackoverflow.com/questions/51744943/is-there-a-way-to-effectively-count-rows-of-a-very-huge-partition-in-cassandra