EDIT1: added a case to describe the problem after the original question.
I wish to query on a column which is not part of my key. If I understand correctly, I need t
PRIMARY KEY (KeyA, KeyB) );
CREATE INDEX ON Table1 (ValueA);
SELECT * FROM Table1 WHERE ValueA > 3000;
The Cassandra way
is to have some partition key and always use that, with a clustering column for ValueA
possibly PRIMARY KEY ((KeyA, KeyB), ValueA)
and then use like:
select * from Table1 where KeyA='xx' and ValueA > 3000
In CQL, you are able to apply the WHERE
clause on all columns once you have created indices for them (i.e., secondary index). Otherwise, you will get the following error:
Bad Request: No indexed columns present in by-columns clause with Equal operator
Unfortunately, even with secondary indices, the WHERE clause are required to have at least one EQ on an secondary index by CQL due to performance issue.
Q: Why is it necessary to always have at least one EQ comparison on secondary indices?
A: Inequalities on secondary indices are always done in memory, so without at least one EQ on another secondary index you will be loading every row in the database, which with a massive database isn't a good idea. So by requiring at least one EQ on an (secondary) index, you hopefully limit the set of rows that need to be read into memory to a manageable size. (Although obviously you can still get into trouble with that as well).
So basically if you have anything besides an EQ comparison, it loads all rows "that elsewise match" your query, and checks if they match, one at a time. Which is not allowed by default since it "could be slow." (In essence, indexes only index "for equality" not for anything else like < and > which indexes on a relational database would).
One thing to note is that if you have more than one non EQ conditions on secondary indices, you also need to include the ALLOW FILTERING
key word in your query, or else you'll get
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
One simple way to work-around is to append a dummy column to your table where all row have the same value on that column. So in this case you are able to perform ranged query on just your desired column. Do realize that these kind of queries on a NoSQL database may be slow/bog down a system.
Example
cqlsh:demo> desc table table1;
CREATE TABLE table1 (
keya int,
keyb int,
dummyvalue int,
valuea int,
PRIMARY KEY (keya, keyb)
) ....
cqlsh:demo> select * from Table1;
keya | keyb | dummyvalue | valuea
------+------+------------+--------
1 | 2 | 0 | 3
4 | 5 | 0 | 6
7 | 8 | 0 | 9
Create secondary indices on ValueA and DummyValue:
cqlsh:demo> create index table1_valuea on table1 (valuea);
cqlsh:demo> create index table1_valueb on table1 (dummyvalue);
Perform ranged query on ValueA
with DummyValue=0
:
cqlsh:demo> select * from table1 where dummyvalue = 0 and valuea > 5 allow filtering;
keya | keyb | dummyvalue | valuea
------+------+------------+--------
4 | 5 | 0 | 6
7 | 8 | 0 | 9
Probably the most flexible way to deal with this scenario in Cassandra will be to have a separate CF for each stat, with sentinel values as keys and the stat value in the column name, like this:
CF: StatName {
Key: SomeSentinelValue {
[Value]:[UserID] = ""
}
}
So let's say your stat is NumAnswers and your user IDs are strings:
CF: NumAnswers {
Key: 0 {
150:Joe = ""
200:Bob = ""
500:Sue = ""
}
Key: 1000 {
1020:George = ""
1300:Ringo = ""
1300:Mary = ""
}
}
So you can see that your keys are essentially buckets of values, which can be as coarse or fine grain as needed for your data, and your columns are composites of value + user ID. You can now hand Cassandra a known key (or set of keys) for the coarse range you need (the equality), then do a range query on the first component of the column name. Note that you cannot write the user ID as value, because this would prevent two users from having the same count.