问题
I have heard repeatedly that secondary indexes (in cassandra) is only for convenience but not for better performance. The only case where it is recommended to use secondary indexes when you have low cardinality (such as gender column
which has two values male or female)
consider this example:
CREATE TABLE users (
userID uuid,
firstname text,
lastname text,
state text,
zip int,
PRIMARY KEY (userID)
);
right now I cannot do this query unless I create a secondary index on users
on firstname index
select * from users where firstname='john'
How do I denormalize this table such that I can have this query: Is this the only efficient way by using composite keys? Any other alternatives or suggestions?
CREATE TABLE users (
userID uuid,
firstname text,
lastname text,
state text,
zip int,
PRIMARY KEY (firstname,userID)
);
回答1:
In order to come up with a good data model, you need to identify first ALL queries you would like to perform. If you only need to look up users by their firstname (or firstname and userID), then your second design is fine...
If you also need to look up users by their last name, then you could create another table having the same fields but a primary key on (lastname, userID). Obviously you will need to update both tables in the same time. Data duplication is fine in Cassandra.
Still, if you are concerned about the space needed for the two or more tables, you could create a single users table partitioned by user id, and additional tables for the fields you want to query by:
CREATE TABLE users (
userID uuid,
firstname text,
lastname text,
state text,
zip int,
PRIMARY KEY (userID)
);
CREATE TABLE users_by_firstname (
firstname text,
userid uuid,
PRIMARY KEY (firstname, userid)
);
The disadvantage of this solution is that you will need two queries to retrieve users by their first name:
SELECT userid FROM users_by_firstname WHERE firstname = 'Joe';
SELECT * FROM users WHERE userid IN (...);
Hope this helps
回答2:
There are a few way of doing this, all with pros and cons.
Your second query will work, but it's just an index table. http://wiki.apache.org/cassandra/SecondaryIndexes A secondary index can be helpful, and if you hit a partition first (which you can't do in your first table), then cassandra's implementation will save you hassle, and keep things "local atomic". Without hitting a partition though, your first table with the index will not be great with your query as it'll hit everything everywhere.
You can fully denormalise, but you can also do a look up table. i.e. Your second table can exist only to return the user id. You can then do a second query to fetch information for only the relevant partitions. If you're expecting few results, this can be good. If not, you'll be hitting many partitions across many nodes (which depending on your cluster size and hotspot avoidance criteria, can be good or bad). Doing many ~1ms queries are usually better than doing one ~1000ms query.
You can do artificial bucketing, and issue n=bucketcount queries. This has extra overhead, but reduces query count and can be a good option.
Your index might be of the first few characters of the firstname. Or it could be a consistent hash into a few buckets. The former can give you "starts with" semantics.
These are just a few options. Going from a logical data model to a physical one requires evaluation of which tradeoffs you wish to make.
回答3:
There's also Materialized views with automatic udpates that partition data on different columns, so therefore making reads much faster and avoid secondary indices altogether. There are some additional benefits of doing this on your own.
The general idea of avoiding hot partitions still remains.
And then, there is also SASI index if you are doing lot of updates on the materialized view primary key to avoid tombstones.
来源:https://stackoverflow.com/questions/25124993/how-to-avoid-secondary-indexes-in-cassandra