问题
In cassandra, it's well known that secondary indexes should be used very sparingly.
If I have a table for example:
User(username, usertype, email, etc..)
Here username is partition key. Now I want to support operation which returns a specific user(username will be given) if and only if usertype is a specific value X.
There are two ways I can do it:
One: Create a secondary index on usertype, possible values ('A', 'B', 'C') and username is partition key.
SELECT * FROM user WHERE username='something' AND usertype='A';
Two:
I can just fetch the row with username to client and then check if usertype is A.
Which approach is better? Please also consider a wide row(not so big, 10s) scenario where not all rows of a partition might have the given value (which requires some client side filtering).
What I'm not clear about secondary indexes is how data is looked up in a particular node.
Ex: SELECT * FROM user WHERE username='something' AND usertype='A'
For example usertype hidden CF has data 'A'-> 'jhon', 'miller', 'chris',...etc, 100 usernames
And the query with partition key is given along with usertype does it scan through all these 100 usernames to match with the username 'something' or does it just fetches by username first and sees the usertype column if it matches with 'A'? How exactly it does that searching? How does the query fares given the index is on low cardinality data and each one is mapped to many rows?
I'm using java as client if that matters.
Update: I understand that I can use clustering (usertype) key for this particular example but I wanted to know the trade off I've asked. My original tables are much more complex.
回答1:
For this example, let's say I create a table to keep track of crew members by ship and id:
CREATE TABLE crewByShip (
ship text,
id int,
firstname text,
lastname text,
gender text,
PRIMARY KEY(ship,id));
And I'll create an index on gender
:
CREATE INDEX crewByShipG_idx ON crewByShip(gender);
After inserting some data, my table looks like this:
ship | id | firstname | gender | lastname
----------+----+-----------+--------+-----------
Serenity | 1 | Hoban | M | Washburne
Serenity | 2 | Zoey | F | Washburne
Serenity | 3 | Malcolm | M | Reynolds
Serenity | 4 | Kaylee | F | Frye
Serenity | 5 | Sheppard | M | Book
Serenity | 6 | Jayne | M | Cobb
Serenity | 7 | Simon | M | Tam
Serenity | 8 | River | F | Tam
Serenity | 9 | Inara | F | Serra
Now I'll turn tracing on, and query a distinct row with the PRIMARY KEY, but also restrict by our index on gender
.
aploetz@cqlsh:stackoverflow2> tracing on;
aploetz@cqlsh:stackoverflow2> SELECT * FROM crewByShip WHERE ship='Serenity' AND id=3 AND gender='M';
ship | id | firstname | gender | lastname
----------+----+-----------+--------+----------
Serenity | 3 | Malcolm | M | Reynolds
(1 rows)
Tracing session: 34ea1840-e8e1-11e4-9cb7-21b264d4c94d
activity | timestamp | source | source_elapsed
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+----------------+----------------
Execute CQL3 query | 2015-04-22 06:17:48.102000 | 192.168.23.129 | 0
Parsing SELECT * FROM crewByShip WHERE ship='Serenity' AND id=3 AND gender='M'; [SharedPool-Worker-1] | 2015-04-22 06:17:48.114000 | 192.168.23.129 | 3715
Preparing statement [SharedPool-Worker-1] | 2015-04-22 06:17:48.116000 | 192.168.23.129 | 4846
Executing single-partition query on users [SharedPool-Worker-2] | 2015-04-22 06:17:48.118000 | 192.168.23.129 | 5730
Acquiring sstable references [SharedPool-Worker-2] | 2015-04-22 06:17:48.118000 | 192.168.23.129 | 5757
Merging memtable tombstones [SharedPool-Worker-2] | 2015-04-22 06:17:48.119000 | 192.168.23.129 | 5793
Key cache hit for sstable 1 [SharedPool-Worker-2] | 2015-04-22 06:17:48.119000 | 192.168.23.129 | 5848
Seeking to partition beginning in data file [SharedPool-Worker-2] | 2015-04-22 06:17:48.120000 | 192.168.23.129 | 5856
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-04-22 06:17:48.120000 | 192.168.23.129 | 7056
Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2015-04-22 06:17:48.121000 | 192.168.23.129 | 7080
Read 1 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-04-22 06:17:48.122000 | 192.168.23.129 | 7143
Computing ranges to query [SharedPool-Worker-1] | 2015-04-22 06:17:48.122000 | 192.168.23.129 | 7578
Candidate index mean cardinalities are CompositesIndexOnRegular{columnDefs=[ColumnDefinition{name=gender, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=1, indexName=crewbyshipg_idx, indexType=COMPOSITES}]}:0. Scanning with crewbyship.crewbyshipg_idx. [SharedPool-Worker-1] | 2015-04-22 06:17:48.122000 | 192.168.23.129 | 7742
Submitting range requests on 1 ranges with a concurrency of 1 (0.0 rows per range expected) [SharedPool-Worker-1] | 2015-04-22 06:17:48.122000 | 192.168.23.129 | 7807
Submitted 1 concurrent range requests covering 1 ranges [SharedPool-Worker-1] | 2015-04-22 06:17:48.122000 | 192.168.23.129 | 7851
Executing indexed scan for [Serenity, Serenity] [SharedPool-Worker-2] | 2015-04-22 06:17:48.123000 | 192.168.23.129 | 10848
Candidate index mean cardinalities are CompositesIndexOnRegular{columnDefs=[ColumnDefinition{name=gender, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=1, indexName=crewbyshipg_idx, indexType=COMPOSITES}]}:0. Scanning with crewbyship.crewbyshipg_idx. [SharedPool-Worker-2] | 2015-04-22 06:17:48.123000 | 192.168.23.129 | 10936
Candidate index mean cardinalities are CompositesIndexOnRegular{columnDefs=[ColumnDefinition{name=gender, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=1, indexName=crewbyshipg_idx, indexType=COMPOSITES}]}:0. Scanning with crewbyship.crewbyshipg_idx. [SharedPool-Worker-2] | 2015-04-22 06:17:48.123000 | 192.168.23.129 | 11007
Executing single-partition query on crewbyship.crewbyshipg_idx [SharedPool-Worker-2] | 2015-04-22 06:17:48.123000 | 192.168.23.129 | 11130
Acquiring sstable references [SharedPool-Worker-2] | 2015-04-22 06:17:48.123000 | 192.168.23.129 | 11139
Merging memtable tombstones [SharedPool-Worker-2] | 2015-04-22 06:17:48.124000 | 192.168.23.129 | 11155
Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-04-22 06:17:48.124000 | 192.168.23.129 | 11253
Merging data from memtables and 0 sstables [SharedPool-Worker-2] | 2015-04-22 06:17:48.124000 | 192.168.23.129 | 11262
Read 1 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-04-22 06:17:48.127000 | 192.168.23.129 | 11281
Executing single-partition query on crewbyship [SharedPool-Worker-2] | 2015-04-22 06:17:48.130000 | 192.168.23.129 | 11369
Acquiring sstable references [SharedPool-Worker-2] | 2015-04-22 06:17:48.131000 | 192.168.23.129 | 11375
Merging memtable tombstones [SharedPool-Worker-2] | 2015-04-22 06:17:48.131000 | 192.168.23.129 | 11383
Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-04-22 06:17:48.133000 | 192.168.23.129 | 11409
Merging data from memtables and 0 sstables [SharedPool-Worker-2] | 2015-04-22 06:17:48.134000 | 192.168.23.129 | 11415
Read 1 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-04-22 06:17:48.138000 | 192.168.23.129 | 11430
Scanned 1 rows and matched 1 [SharedPool-Worker-2] | 2015-04-22 06:17:48.138000 | 192.168.23.129 | 11490
Request complete | 2015-04-22 06:17:48.115679 | 192.168.23.129 | 13679
Now, I'll re-run the same query, but without the superfluous index on gender
.
aploetz@cqlsh:stackoverflow2> SELECT * FROM crewByShip WHERE ship='Serenity' AND id=3;
ship | id | firstname | gender | lastname
----------+----+-----------+--------+----------
Serenity | 3 | Malcolm | M | Reynolds
(1 rows)
Tracing session: 38d7f440-e8e1-11e4-9cb7-21b264d4c94d
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------+----------------------------+----------------+----------------
Execute CQL3 query | 2015-04-22 06:17:54.692000 | 192.168.23.129 | 0
Parsing SELECT * FROM crewByShip WHERE ship='Serenity' AND id=3; [SharedPool-Worker-1] | 2015-04-22 06:17:54.695000 | 192.168.23.129 | 87
Preparing statement [SharedPool-Worker-1] | 2015-04-22 06:17:54.696000 | 192.168.23.129 | 246
Executing single-partition query on users [SharedPool-Worker-3] | 2015-04-22 06:17:54.697000 | 192.168.23.129 | 1185
Acquiring sstable references [SharedPool-Worker-3] | 2015-04-22 06:17:54.698000 | 192.168.23.129 | 1197
Merging memtable tombstones [SharedPool-Worker-3] | 2015-04-22 06:17:54.698000 | 192.168.23.129 | 1215
Key cache hit for sstable 1 [SharedPool-Worker-3] | 2015-04-22 06:17:54.700000 | 192.168.23.129 | 1249
Seeking to partition beginning in data file [SharedPool-Worker-3] | 2015-04-22 06:17:54.700000 | 192.168.23.129 | 1278
Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-3] | 2015-04-22 06:17:54.701000 | 192.168.23.129 | 3309
Merging data from memtables and 1 sstables [SharedPool-Worker-3] | 2015-04-22 06:17:54.701000 | 192.168.23.129 | 3333
Read 1 live and 0 tombstoned cells [SharedPool-Worker-3] | 2015-04-22 06:17:54.702000 | 192.168.23.129 | 3368
Executing single-partition query on crewbyship [SharedPool-Worker-2] | 2015-04-22 06:17:54.702000 | 192.168.23.129 | 4607
Acquiring sstable references [SharedPool-Worker-2] | 2015-04-22 06:17:54.704000 | 192.168.23.129 | 4633
Merging memtable tombstones [SharedPool-Worker-2] | 2015-04-22 06:17:54.704000 | 192.168.23.129 | 4643
Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-04-22 06:17:54.705000 | 192.168.23.129 | 4678
Merging data from memtables and 0 sstables [SharedPool-Worker-2] | 2015-04-22 06:17:54.705000 | 192.168.23.129 | 4683
Read 1 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-04-22 06:17:54.706000 | 192.168.23.129 | 4697
Request complete | 2015-04-22 06:17:54.697676 | 192.168.23.129 | 5676
As you can see, the "source_elapsed" for the query with the secondary index was more than twice what it was for the same query (which returned the same row) without the index.
I think we can definitely say that using a secondary index on a low-cardinality column in a wide-row table will not perform well. Now while I won't say that filtering client-side is a good idea, in this case, with a small result set, it probably would be the better option.
回答2:
The good option here would be to create a composite primary key consisting of username and usertype with username being partition key and usertype a cluster key. You will not even need an index and the query will work.
CREATE TABLE users (
username text,
usertype text,
....
PRIMARY KEY ((username), usertype)
)
来源:https://stackoverflow.com/questions/29659564/validating-row-at-client-side-better-than-secondary-index-with-whole-primary-key