Most efficient way to query cassandra in small time-based chunks

问题

My Cassandra-based application needs to read the rows changed since last read. For this purpose, we are planning to have a table changed_rows that will contain two columns -

ID - The ID of the changed row and
Updated_Time - The timestamp when it was changed.

What is the best way to read such a table such that it reads small group of rows ordered by time. Example: if the table is:

ID   Updated_Time
foo    1000
bar    1200
abc    2000
pqr    2500
zyx    2900
 ...
xyz   901000
 ...

I have shown IDs to be simple 3-letter keys, in reality they are UUIDs. Also, time shown above is shown as an integer for the sake of simplicity, but its an actual Cassandra timestamp (Or Java Date). The Updated_Time column is a monotonically increasing one.

If I query this data with:

SELECT * FROM changed_rows WHERE Updated_Time < toTimestamp(now())

I get the following error:

Cannot execute this query as it might involve data filtering and 
thus may have unpredictable performance... Use Allow Filtering

But I think Allow Filtering in this case would kill the performance. The Cassandra index page warns to avoid indexes for high cardinality columns and the Updated_Time above sure seems like high cardinality.

I do not know the ID column before-hand because the purpose of the query is to know the IDs updated between given time intervals.

What is the best way to query Cassandra in this case then?
Can I change my table somehow to run the time-chunk query more efficiently?

Note: This should sound somewhat similar to Cassandra-CDC feature but we cannot use the same because our solution should work for all the Cassandra versions

回答1:

Assuming you know the time intervals you want to query, you need to create another table like the following:

CREATE TABLE modified_records (
    timeslot timestamp,
    updatedtime timestamp,
    recordid timeuuid,
    PRIMARY KEY (timeslot, updatedtime)
);

Now you can split your "updated record log" into time slices, eg 1 hour, and fill the table like this:

INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 09:00:00', '2017-02-27 09:36:00', 123);
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 09:00:00', '2017-02-27 09:56:00', 456);
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 10:00:00', '2017-02-27 10:00:13', 789);

where you use a part of your updatedtime timestamp as a partition key, eg in this case you round to the integral hour. You then query by specifying the time slot only, eg:

SELECT * FROM modified_records WHERE timeslot = '2017-02-27 09:00:00';
SELECT * FROM modified_records WHERE timeslot = '2017-02-27 10:00:00';

Depending on how often your records get updated, you can go with smaller or bigger time slices, eg every 6 hours, or 1 day, or every 15 minutes. This structure is very flexible. You only need to know the timeslot you want to query. If you need to span multiple timeslots you'll need to perform multiple queries.

来源：https://stackoverflow.com/questions/42476481/most-efficient-way-to-query-cassandra-in-small-time-based-chunks

标签

optimization

cassandra

datastax

cql

cqlsh