Counting columns, very slow CountQuery vs SliceQuery operations

问题

I've written a "census" program to iterate through all the rows in a Column Family and within each row count the columns, recording the max value and row key. I've been spending more time with the Hector client but have written a Pelops client as well to test.

The basic flow is to use use a RangeSlicesQuery to iterate through the rows, and then at each row, use a SliceQuery to iterate through and collect the stats. Works similar in Pelops, just different APIs. Downside is having to do the buffering manually, picking buffer sizes for both rows and columns... My current data is 12 million rows, with largest column count ~25K, so yeah takes a while... in my current configuration, am getting >25K rows per second.

Looking for ways to improve and discovered Hector's CountQuery (which I assume, uses Thrift client get_count()). Thinking it would be faster to just iterate keys (use RangeSlicesQuery.setReturnKeysOnly()), and then re-use a CountQuery on each row key, I revised the code.

Not only was it slower, but 30x slower! (processed only 900 rows per second)...

Is there a better way to count columns?

回答1:

Not sure what's going on with Hector -- I'd expect it to be roughly 2x slower, not 30x slower.

More generally, keeping a denormalized count using a counter column is probably better than a full CF scan: http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-2-counters

来源：https://stackoverflow.com/questions/7406178/counting-columns-very-slow-countquery-vs-slicequery-operations

标签

cassandra

hector

pelops

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!