问题
I have the following 'Tasks' table in Cassandra.
- Task_ID UUID - Partition Key
- Starts_On TIMESTAMP - Clustering Column
- Ends_On TIMESTAMP - Clustering Column
I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).
What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.
回答1:
Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).
Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):
CREATE TABLE tasks (
p int,
task_id timeuuid,
start int,
end int,
end_range int static,
PRIMARY KEY(p, start));
Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:
CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
'if (end <= end_range) return task_id; else return null;';
Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.
So first we have to set the end_range we want:
UPDATE tasks SET end_range=15 where p=1;
And let's say we have this data:
SELECT * FROM tasks;
p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
1 | 1 | 15 | 5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
1 | 2 | 15 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
1 | 4 | 15 | 22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
1 | 8 | 15 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
Now let's get the task_id's that have start >= 2 and end <= 15:
SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
WHERE p=1 AND start >= 2;
start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
2 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
4 | 22 | null
8 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb
So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.
Anyway not a perfect method obviously, but it might be something you can work with. :)
回答2:
In CQL you can only range query on one clustering column at a time, so you'll probably need to do some kind of client side filtering in your application. So you could range query on starts_on, and as rows are returned, check ends_on in your application and discard rows that you don't want.
回答3:
A while ago I wrote an application that faced a similar problem, in querying events that had both start and end times. For our scenario, I was able to partition on a userID (as queries were for events of a specific user), set a clustering column for type of event, and also for event date. The table structure looked something like this:
CREATE TABLE userEvents (
userid UUID,
eventTime TIMEUUID,
eventType TEXT,
eventDesc TEXT,
PRIMARY KEY ((userid),eventTime,eventType));
With this structure, I can query by userid
and eventtime
:
SELECT userid,dateof(eventtime),eventtype,eventdesc FROM userevents
WHERE userid=dd95c5a7-e98d-4f79-88de-565fab8e9a68
AND eventtime >= mintimeuuid('2015-08-24 00:00:00-0500');
userid | system.dateof(eventtime) | eventtype | eventdesc
--------------------------------------+--------------------------+-----------+-----------
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 08:22:53-0500 | End | event1
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 11:45:00-0500 | Begin | lunch
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 12:45:00-0500 | End | lunch
(3 rows)
That query will give me all event rows for a particular user for today.
NOTES:
- If you need to query by whether or not an event is starting or ending (I did not) you will want to order
eventType
ahead ofeventTime
in the primary key. - You will store each event twice (once for the beginning, and once for the end). Duplication of data usually isn't much of a concern in Cassandra, but I did want to explicitly point that out.
- In your case, you will want to find a good key to partition on, as
Task_ID
will be too unique (high cardinality). This is a must in Cassandra, as you cannot range query on a partition key (only a clustering key).
回答4:
There doesn't seem to be a completely satisfactory way to do this in Cassandra but the following method seems to work well:
I cluster the table on the Starts_On
timestamp in descending order. (Ends_On
is just a regular column.) Then I constrain the query with Starts_On<?
where the parameter is the end of the period of interest - i.e. filter out events that start after our period of interest has finished.
I then iterate through the results until the row Ends_On
is earlier than the start of the period of interest and throw away the rest of the results rows. (Note that this assumes events don't overlap - there are no subsequent results with a later Ends_On
.)
Throwing away the rest of the result rows might seem wasteful, but here's the crucial bit: You can set the paging size sufficiently small that the number of rows to throw away is relatively small, even if the total number of rows is very large.
Ideally you want the paging size just a little bigger than the total number of relevant rows that you expect to receive back. If the paging size is too small the driver ends up retrieving multiple pages, which could hurt performance. If it is too large you end up throwing away a lot of rows and again this could hurt performance by transfering more data than is necessary. In practice you can probably find a good compromise.
来源:https://stackoverflow.com/questions/32178825/cassandra-overlapping-data-ranges