range query in Cassandra

拟墨画扇 提交于 2019-12-10 11:18:38

问题


I'm using Cassandra 2.1.2 with the corresponding DataStax Java driver and the Object mapping provided by DataStax.

following table definition:

CREATE TABLE IF NOT EXISTS ses.tim (id text PRIMARY KEY, start bigint, cid int);

the mapping:

@Table(keyspace = "ses", name = "tim")
class MyObj {
    @PartitionKey
    private String id;
    private Long start;
    private int cid;
}

the accessor

@Accessor
interface MyAccessor {
    @Query("SELECT * FROM ses.tim WHERE id = :iid")
    MyObj get(@Param("iid") String id);

    @Query(SELECT * FROM ses.tim WHERE start <= :sstart")
    Result<MyObj> get(@Param("sstart") long start);
}

as indicated within the accessor I want to do a query that returns everything where 'start' is smaller or equal than a specific value.

With this definition of the table it's simply not possible. Therefore I tried creating a secondary index:

CREATE INDEX IF NOT EXISTS myindex ON ses.tim (start);

this seems to be not working as well (I read a lot of explanations why its decided to not support this, but I still don't understand why somebody would give such restrictions, anyhow..)

so, as far as I understandd, we have to have at least one equals in the WHERE clause

@Query(SELECT * FROM ses.tim WHERE cid = :ccid AND start <= :sstart")

CREATE INDEX IF NOT EXISTS myindex2 ON ses.tim (cid);

if this would work I would have to know ALL possible values for cid, and query them separately and do the rest on the client... but the error I get is

Cannot execute this query as it might involve data filtering and thus may have unpredictable performance

then I tried

id text, start bigint, cid int, PRIMARY KEY (id, start, cid)

with

@Table(keyspace = "ses", name = "tim")
class MyObj {
    @PartitionKey
    private String id;
    @ClusteringColumn(0)
    private Long start;
    @ClusteringColumn(1)
    private int cid;
}

but still without luck.

furthermore, I tried to set 'start' as PartitionKey, but it's only possible to query with Equals again...

what am I missing? how can I achieve getting results for this type of query?

EDIT: version updated to correct one


回答1:


You could consider denormalizing your data if you have different query-ability needs for the same set of data. Based on your question, it sounds like you want the following:

  • Query by id
  • Query by start < X

The first query works fine as you indicated with your current schema. The second query however cannot work as is without a secondary index which will be slow for reasons you have already investigated (I always point to this blog post with respect to secondary indexes.

You indicated that you did not want to partition on cid since you would need to know all possible values for cid.

Three ideas I can think of:

  • Create a separate table with a dummy primary key so all of your data is stored in the same partition. This can be problematic though if you have many entries creating a super-wide partition and hotspots on whatever nodes hold that data. How many do you plan on having?

    create table if not exists tim (
        dummy int, 
        start bigint, 
        id text, 
        cid int, 
        primary key (dummy, start)
    );
    

    You could then make queries like:

    select * from tim where dummy=0 and start <= 10;
    
  • The other option is to use ALLOW FILTERING on your original table which will still do an expensive range query and filter through the data.

    select * from tim where start <= 10 ALLOW FILTERING;
    
  • Another option is to use something like the spark-connector to create a spark job that makes the query. The connector will break up an expensive range query into smaller tasks and map the data to RDDs, allowing you flexibility to make more complex queries with good performance.




回答2:


I'm using Cassandra 2.1.3

I don't think 2.1.3 has been released. The project site currently shows 2.1.2 as the highest version.

From what I can see, your main issue here is that your partitioning key id is either unique or has a cardinality that is too high to be useful to you. Currently, you are taking an RDBMS-style approach with storing your data (by unique ID). With Cassandra, you want to store your data in a way that makes sense to query it. And the first step with that, is to pick a good key to partition your data on.

Therefore I tried creating a secondary index

Another thing you don't want to do here, is use a secondary index. I can see that you are tempted to do so, and you should get that idea out of your head right away. Secondary indexes were created for convenience. They were not created for performance, nor were they created as a way to take shortcuts on your data model.

Cannot execute this query as it might involve data filtering and thus
may have unpredictable performance.

Speaking of tempation, when seeing this message, you might think about try adding ALLOW FILTERING to your query. Definitely do not do that. It flat out warns you that doing so will not perform well, and you should heed that warning.

if this would work I would have to know ALL possible values for cid, and query them separately and do the rest on the client.

How unique is cid? If having to obtain and iterate through all of the cids is too cumbersome, then you should consider picking/creating a less-unique value to partition on. However, assuming that cid will work, this is how your table definition should look:

CREATE TABLE IF NOT EXISTS ses.tim 
(cid int,
 start bigint,
 id text,
 PRIMARY KEY ((cid),start);

@Table(keyspace = "ses", name = "tim")
class MyObj {
    @PartitionKey
    private int cid;
    @ClusteringColumn(0)
    private Long start;
    private String id;
}

With this underlying table definition, this query should now work.

@Query("SELECT * FROM ses.tim WHERE cid = :ccid AND start <= :sstart")

Give your data model another look, and (if cid is not very unique) see if you can come up with a better column to group your data by. For more information, read through Patrick McFadin's article Getting Started With Time Series Data Modeling. He discusses a few use cases that are somewhat similar to yours, and might point you in the right direction.



来源:https://stackoverflow.com/questions/28303928/range-query-in-cassandra

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!