I am doing time series data modelling where I have a start date and end date of events. I need to query on that data model like the following:
Select * from tabl
I had to solve a similar problem in one of my former positions. This is one way in which you could accomplish this...
I need to make query on that data model like the following:
Select * from tablename where startdate>'2012-08-09' and enddate<'2012-09-09'
.
There are two modeling problems preventing this query from working. First of all, to run a range query, you first need to limit your query with a partition key. With time series data the best idea is to create something called a time bucket. For this example I'll partition the data by month, with a partition key called monthbucket
.
The other problem, is that you can only run a range query on a single column/key value. This becomes problematic when you want to query by both a start and end date. One solution, is to store each row in the table twice, and create an additional clustering key to hold the value of whether the row is the beginning row or the end row. I'll just call this column beginend
.
Given those notes, I'll create a table that looks like this:
CREATE TABLE events (
monthBucket TEXT,
eventDate TIMESTAMP,
beginEnd TEXT,
eventid UUID,
eventName TEXT,
PRIMARY KEY (monthBucket, eventDate, beginEnd, eventid))
WITH CLUSTERING ORDER BY (eventDate DESC, beginEnd ASC, eventid ASC);
eventDate
in DESCending order.eventid
in this case).After INSERTing some rows, let's just query by a partition key of September, 2015:
aploetz@cqlsh:stackoverflow> SELECT * FROM events WHERE monthbucket='201509';
monthbucket | eventdate | beginend | eventid | eventname
-------------+--------------------------+----------+--------------------------------------+------------------------
201509 | 2015-09-25 23:59:59+0000 | E | a223ad16-2afd-4213-bee3-08a2c4dd63e6 | Hobbit Day
201509 | 2015-09-25 00:00:00+0000 | B | a223ad16-2afd-4213-bee3-08a2c4dd63e6 | Hobbit Day
201509 | 2015-09-24 23:59:59+0000 | E | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-22 00:00:00+0000 | B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-19 23:59:59+0000 | E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
201509 | 2015-09-19 00:00:00+0000 | B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
(6 rows)
Similar to your example, let's say that I want to query events between September 18th and September 24th:
aploetz@cqlsh:stackoverflow> SELECT * FROM events WHERE monthbucket='201509' AND eventdate > '2015-09-18' AND eventdate < '2015-09-24';
monthbucket | eventdate | beginend | eventid | eventname
-------------+--------------------------+----------+--------------------------------------+------------------------
201509 | 2015-09-22 00:00:00+0000 | B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra Summit
201509 | 2015-09-19 23:59:59+0000 | E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
201509 | 2015-09-19 00:00:00+0000 | B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
(3 rows)
As you can see, I should get three rows: A beginning and an end row for "Talk Like A Pirate Day" and a beginning row for the 2015 Cassandra Summit.
As with all data modeling approaches, there are trade-offs to be made. In this case to model for querying on both dates, the trade-off is that you have to duplicate your rows. And of course, to be able to range query at all, you have to decide on a good partition key (monthbucket
) that offers relevant data and the required query flexibility. In any case, give it a try and see if you can make it work for your use case.
Edit to answer questions:
If I want to find all events between 25th Nov,2015 to 25th Nov,2016. How that could be possible ?
That's where you'd need to figure out the best time bucket for your application. Think about your most-common queries, and model off of that. Now you don't want to store too much in a single row (bucket), because that will kill your data distribution. So try to find a happy medium between query flexibility and data distribution.
In this particular case with monthBucket
you'd have to execute a query for each individual month. The application that I designed this solution for never looked at an entire years' worth of events at once. If that's a query pattern you need to support, then you'll need to make your time bucket a little bigger.
Is there any way to remove this duplicate row from the result set only?
Nope. Duplicates would need to be handled/ignored at the application level. Cassandra CQL does have a DISTINCT keyword, but it only functions on partition keys.
Can this type of merging be done at the Cassandra level ?
No, Cassandra does not have a way to JOIN tables together. And application-side joins are possible, but don't perform well and are technically an anti-pattern.
Handling data on the application-side (whether joining or filtering) is typically not a good idea. But the key is moderation. If you query 20 events and have to ignore dupes for some of them, that's not too big of a deal. But querying 20,000,000 events and applying an application-side process at that volume is not going to scale well at all. Again, this is where you have to look at the options available, and decide what will work for your application.