问题
My Cassandra-based application needs to read the rows changed since last read.
For this purpose, we are planning to have a table changed_rows
that will contain two columns -
- ID - The ID of the changed row and
- Updated_Time - The timestamp when it was changed.
What is the best way to read such a table such that it reads small group of rows ordered by time. Example: if the table is:
ID Updated_Time
foo 1000
bar 1200
abc 2000
pqr 2500
zyx 2900
...
xyz 901000
...
I have shown IDs to be simple 3-letter keys, in reality they are UUIDs.
Also, time shown above is shown as an integer for the sake of simplicity, but its an actual Cassandra timestamp (Or Java Date). The Updated_Time
column is a monotonically increasing one.
If I query this data with:
SELECT * FROM changed_rows WHERE Updated_Time < toTimestamp(now())
I get the following error:
Cannot execute this query as it might involve data filtering and
thus may have unpredictable performance... Use Allow Filtering
But I think Allow Filtering
in this case would kill the performance.
The Cassandra index page warns to avoid indexes for high cardinality columns and the Updated_Time
above sure seems like high cardinality.
I do not know the ID column before-hand because the purpose of the query is to know the IDs updated between given time intervals.
What is the best way to query Cassandra in this case then?
Can I change my table somehow to run the time-chunk query more efficiently?
Note: This should sound somewhat similar to Cassandra-CDC feature but we cannot use the same because our solution should work for all the Cassandra versions
回答1:
Assuming you know the time intervals you want to query, you need to create another table like the following:
CREATE TABLE modified_records (
timeslot timestamp,
updatedtime timestamp,
recordid timeuuid,
PRIMARY KEY (timeslot, updatedtime)
);
Now you can split your "updated record log" into time slices, eg 1 hour, and fill the table like this:
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 09:00:00', '2017-02-27 09:36:00', 123);
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 09:00:00', '2017-02-27 09:56:00', 456);
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 10:00:00', '2017-02-27 10:00:13', 789);
where you use a part of your updatedtime
timestamp as a partition key, eg in this case you round to the integral hour. You then query by specifying the time slot only, eg:
SELECT * FROM modified_records WHERE timeslot = '2017-02-27 09:00:00';
SELECT * FROM modified_records WHERE timeslot = '2017-02-27 10:00:00';
Depending on how often your records get updated, you can go with smaller or bigger time slices, eg every 6 hours, or 1 day, or every 15 minutes. This structure is very flexible. You only need to know the timeslot you want to query. If you need to span multiple timeslots you'll need to perform multiple queries.
来源:https://stackoverflow.com/questions/42476481/most-efficient-way-to-query-cassandra-in-small-time-based-chunks