We are using Cassandra as the data historian for our fleet management solution. We have a table in Cassandra , which stores the details of journey made by the vehicle. The table
select * from journeydetails where bucketid in('2015-12') and vehicleid in('1234567')
and starttime > '2015-12-1 00:00:00' and starttime < '2015-12-3 23:59:59'
and travelduration > 1800000;
That's not going to work. The reason goes back to how Cassandra stores data on-disk. The idea with Cassandra is that it is very efficient at returning a single row with a precise key, or at returning a continuous range of rows from the disk.
Your rows are partitioned by bucketid
, and then sorted on disk by vehicleid
, starttime
, and travelduration
. Because you are already executing a range query (non-EQ relation) on starttime
, you cannot restrict the key that follows. This is because the travelduration
restriction may disqualify some of the rows in your range condition. This would result in an inefficient, non-continuous read. Cassandra is designed to protect you from writing queries (such as this), which may have unpredictable performance.
Here are two alternatives:
1- If you could restrict all of your key columns prior to travelduration
(with an equals relation), then you could apply a your greater-than condition:
select * from journeydetails where bucketid='2015-12' and vehicleid='1234567'
and starttime='2015-12-1 00:00:00' and travelduration > 1800000;
Of course, restricting on an exact starttime
may not be terribly useful.
2- Another approach would be to omit travelduration
altogether, and then your original query would work.
select * from journeydetails where bucketid='2015-12' and vehicleid='1234567'
and starttime > '2015-12-1 00:00:00' and starttime < '2015-12-3 23:59:59';
Unfortunately, Cassandra does not offer a large degree of query flexibility. Many people have found success using a solution like Spark (alongside Cassandra) to achieve this level of reporting.
And just a side note, but don't use IN
unless you have to. Querying with IN
is similar to using a secondary index, in that Cassandra has to talk to several nodes to satisfy your query. Calling it with a single item probably isn't too big of a deal. But IN
is one of those old RDBMS habits that you should really break before getting too deep into Cassandra.