Cassandra performance for long rows

前端 未结 2 1781
情深已故
情深已故 2021-02-02 03:17

I\'m looking at implementing a CF in Cassandra that has very long rows (hundreds of thousands to millions of columns per row).

Using entirely dummy data, I\'ve inserted

相关标签:
2条回答
  • 2021-02-02 03:59

    A good resource on this is Aaron Morton's blog post on Cassandra's Reversed Comparators. From the article:

    Recall from my post on Cassandra Query Plans that once rows get to a certain size they include an index of the columns. And that the entire index must be read whenever any part of the index needs to be used, which is the case when using a Slice Range that specifies start or reversed. So the fastest slice query to run against a row was one that retrieved the first X columns in a row by only specifying a column count.

    If you are mostly reading from the end of a row (for example if you are storing things by timestamp and you mostly want to look at recent data) you can use the Reversed Comparator which stores you columns in descending order. This will give you much better (and more consistent) query performance.

    If your read patterns are more random you might be better off partitioning your data across multiple rows.

    0 讨论(0)
  • 2021-02-02 04:02

    psanford's comment led me to the answer. It turns out that Cassandra <1.1.0 (currently in beta) has slow performance on slices on long rows in Memtables (that have not been flushed to disk) but better performance on SSTables flushed to disk with the same data.

    see http://mail-archives.apache.org/mod_mbox/cassandra-user/201201.mbox/%3CCAA_K6YvZ=vd=Bjk6BaEg41_r1gfjFaa63uNSXQKxgeB-oq2e5A@mail.gmail.com%3E and https://issues.apache.org/jira/browse/CASSANDRA-3545.

    With my example, the first 1.8 million rows had been flushed to disk, so slices over that range were fast, but the last ~200,000 rows hadn't been flushed to disk and were still in memtables. As the memtables slicing is slow on long rows, this is why I saw bad performance at the end of the rows (my data was inserted in column order).

    This can be fixed by manually calling a flush on the cassandra nodes. A patch has been applied to 1.1.0 to fix this and I can confirm that this fixes the issue for me.

    I hope this helps anyone else with the same problem.

    0 讨论(0)
提交回复
热议问题