Is there a way to effectively count rows of a very huge partition in Cassandra?

删除回忆录丶 提交于 2020-01-05 04:07:12

问题


I have very huge Cassandra table containing over 1 billion records. My primary key forms like this: "(partition_id, cluster_id1, cluster_id2)". Now for several particular partition_id, I have too many records that I can't run row count on these partition keys without timeout exception raised.

What I ran in cqlsh is:

SELECT count(*) FROM relation WHERE partition_id='some_huge_partition';

I got this exception:

ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}

I tried to set --connect-timeout and --request-timeout, no luck. I counted same data in ElasticSearch, the row count is approximately 30 million (the same partition).

My Cassandra is 3.11.2 and CQLSH is 5.0.1. The Cassandra cluster contains 3 nodes and each has more 1T HDD(fairly old servers, more than 8 years).

So in short, my questions are:

  1. How can I count it? is it even possible to count a huge partition in Cassandra?
  2. Can I use COPY TO command with partition key as it's filter, so I can count it in the exported CSV file?
  3. Is there a way I can monitor the insert process before any partition getting too huge?

Big thanks advanced.


回答1:


Yes, working with large partitions is difficult with Cassandra. There really isn't a good way to monitor particular partition sizes, although Cassandra will warn about writing large partitions in your system.log. Unbound partition growth is something you need to address during the creation of your table, and it involves adding an additional (usually time based) partition key derived from understanding your business use case.

The answer here, is that you may be able to export the data in the partition using the COPY command. To keep it from timing out, you'll want to use the PAGESIZE and PAGETIMEOUT options, kind of like this:

COPY products TO '/home/aploetz/products.txt'
  WITH DELIMITER='|' AND HEADER=true
  AND PAGETIMEOUT=40 AND PAGESIZE=20;

That will export the products table to a pipe-delimited file, with a header, at a page size of 20 rows at a time and with a 40 second timeout for each page fetch.

If you still get timeouts, try decreasing PAGESIZE and/or increasing PAGETIMEOUT.




回答2:


I've found that with Spark and the awesome Spark Cassandra Connector library, I can finally count a large table without encountering any of the timeout limitations. The Python Spark code is like this:

tbl_user_activity = sqlContext.read.format("org.apache.spark.sql.cassandra").options(keyspace='ks1', table='user_activity').load()
tbl_user_activity.where('id = 1').count()

It will run for a while but in the end it works.



来源:https://stackoverflow.com/questions/51744943/is-there-a-way-to-effectively-count-rows-of-a-very-huge-partition-in-cassandra

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!