optimize in clause queries cassandra?

问题

I have a table like this in Scylladb. To make it clear I have removed lot of columns from below table but in general this table has ~25 columns total.

CREATE TABLE testks.client (
    client_id int,
    lmd timestamp,
    cola list<text>,
    colb list<text>,
    colc boolean,
    cold int,
    cole int,
    colf text,
    colg set<frozen<colg>>,
    colh text,
    PRIMARY KEY (client_id, lmd)
) WITH CLUSTERING ORDER BY (lmd DESC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'DAYS'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 172800
    AND max_index_interval = 1024
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

Now our query pattern is like this. I can have more than 50 clientIds in my IN clause.

select * FROM testks.client WHERE client_id IN ? PER PARTITION LIMIT 1

Few questions:

After reading online it looks like IN clause is not good for obvious performance reasons so is there any way to optimize my table for my query pattern or Cassandra/Scylladb is not the good use case for this?
We use C# driver to execute above query and we are seeing performance issues with our data model and query pattern. Is it better to execute individual client id async or I should keep doing IN clause queries with all clientId's in it?

We are running 6 node cluster all in one DC with RF as 3. We read/write as Local Quorum.

回答1:

When you issue IN on partition key, then request is sent to coordinator node (I don't remember, I think that in this case, it could be an arbitrary node), and then coordinator node decomposes this IN into queries to individual partitions, perform queries to specific replicas, collect data back, and sent to caller. All of this lead to additional round trips between coordinator nodes and replicas, and an additional load to coordinator.

Usually, the better solution would be to issue N asynchronous queries for every partition from the IN list, and collect data on client side - when you use prepared statement, driver will able to use token-aware load balancing, and will send query directly to replica holding given partition, so you can avoid additional network round trips between coordinator and replicas.

来源：https://stackoverflow.com/questions/61765467/optimize-in-clause-queries-cassandra

标签

database-design

cassandra

scylla