cassandra get all records in time range

后端 未结 3 587
梦毁少年i
梦毁少年i 2020-11-29 04:22

I have to work with a column family that has (user_id, timestamp) as key. In my query I would like to fetch all records in a given time range independent of the user_id. Thi

相关标签:
3条回答
  • 2020-11-29 04:58

    In general, this may be an indication that you've not modelled your schema to suit your data query, which is the Cassandra way of doing things (https://docs.datastax.com/en/cql/3.3/cql/ddl/dataModelingApproach.html)...

    So, ideally, you'd model your schema to suit the query. There are some resources around on how to do time series modelling for Cassandra, although e.g. this slideshare seems to be similar to what you've got - but it's not advertising support for the kind of query you want to do. I don't think I've actually found examples of Cassandra schemas that support "get me all data for a certain time range" queries.

    In any case, for the rest of this answer I'll assume you're stuck with the schema you've got for this iteration.

    You can do this as two queries:

    SELECT DISTINCT user_id FROM userlog;
    

    And then, for each user,

    SELECT * FROM userlog WHERE
      user_id='<user>'
      AND ts >= '2013-01-01 00:00:00+0200'
      AND ts <= '2013-08-13 23:59:00+0200';
    

    If the set of user IDs is small to medium sized, you might be able to get away with using an IN query:

    SELECT * FROM userlog WHERE
      user_id IN ('sampleuser', 'sampleadmin', ...)
      AND ts >= '2013-01-01 00:00:00+0200'
      AND ts <= '2013-08-13 23:59:00+0200';
    

    Note that this works without ALLOW FILTERING.

    0 讨论(0)
  • 2020-11-29 05:02

    The timeout is because Cassandra is taking longer than the timeout (default is 10 seconds) to return the data. For your query, Cassandra will attempt to fetch the entire dataset before returning. For more than a few records this can easily take longer than the timeout.

    For queries that are producing lots of data you need to page e.g.

    SELECT * FROM userlog WHERE ts >= '2013-01-01 00:00:00+0200' AND  ts <= '2013-08-13 23:59:00+0200' AND token(user_id) > previous_token LIMIT 100 ALLOW FILTERING;
    

    where user_id is the previous user_id returned. You will also need to page on ts to guarantee you get all the records for the last user_id returned.

    Alternatively, in Cassandra 2.0.0 (just released), paging is done transparently so your original query should work with no timeout or manual paging.

    The ALLOW FILTERING means Cassandra is reading through all your data, but only returning data within the range specified. This is only efficient if the range is most of the data. If you wanted to find records within e.g. a 5 minute time window, this would be very inefficient.

    0 讨论(0)
  • 2020-11-29 05:02

    It appears the hotness for being able to query by time (or any range) is to specify some "other column" as your Partition key, and then specify timestamp as a "clustering column"

    CREATE TABLE postsbyuser (
         userid bigint,
         posttime timestamp,
         postid uuid,
         postcontent text,
         PRIMARY KEY ((userid), posttime)
       ) WITH CLUSTERING ORDER BY (posttime DESC);
    

    insert fake data

      insert into postsbyuser (userid, posttime) values (77, '2013-04-03 07:04:00');
    

    and query (the important part being that it is a "fast" query and ALLOW FILTERING is not required, which is how it should be):

      SELECT * FROM postsbyuser where userid=77 and posttime > '2013-04-03 07:03:00' and posttime < '2013-04-03 08:04:00';
    

    You can also use tricks to group by day (and thus be able to query by day) or what not.

    If you use the "group by day" style trick then a secondary index would also be an option (though secondary indexes seem to only work with "EQ" = operator?).

    0 讨论(0)
提交回复
热议问题