问题
We have a table with about 40k rows, querying on secondary index is slow(30 seconds on production). Our cassandra is 1.2.8. The table schema is as following:
CREATE TABLE usertask (
tid uuid PRIMARY KEY,
content text,
ts int
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
CREATE INDEX usertask_ts_idx ON usertask (ts);
When I turn on tracing, I notice there is a lot of lines like the following:
Executing single-partition query on usertask.usertask_ts_idx
With only 40k rows, it looks like there are some thousands of query on usertask_ts_idx. What could be the problem? Thanks
More investigation
I try the same query on our test server, it is much faster(30 seconds on prod, 1-2 seconds on test server). After comparing the tracing log, the difference is the time spending at seeking to partition indexed section in data file. On our production it takes 1000-3000 micro sec for each seek, on dev server it takes 100 micro seconds. I guess our production server has not enough memory to cache the data file so it is slow at seeking in data file.
回答1:
I am presuming ts
is a timestamp, in which case this is not a good candidate for a secondary index. The reason is that it's a high cardinality value (i.e. all values are essentially unique). This means you'll end up with almost one row in the index for each row in usertask
--effectively resulting in a join operation. Joins are terribly slow on a distributed database. Since you haven't shown your query I'm not sure exactly what you're doing, but you'll need to rethink your model if you want to query based on time.
来源:https://stackoverflow.com/questions/20093181/cassandra-query-on-secondary-index-is-very-slow