High Availability & Performance consideration with secondary index in Cassandra

后端 未结 1 1592
野的像风
野的像风 2021-01-22 11:50

I have a Set up with: 5 Cassandra node cluster with RF =3, I performed a secondary index for a column in the table \'user\',

1) As per my study

1条回答
  •  臣服心动
    2021-01-22 11:58

    Regd your Question 1:

    Does it mean that in the five node cluster only in one node the secondary index will be available?

    The secondary index is available in every node of the cluster, built upon the data in that node and its just local to that node. That is, its aware of only the primary keys in that particular node. You can imagine the secondary index to be a lookup table with references to primary keys on that node.

    So every node builds its own secondary index (in your case all 5), but unaware of each others references.

    If not in the RF =3 for user table, In how many nodes the Secondary Index table will be available?

    There is no replication factor for secondary indexes, since its local to every node. Since your data is already being replicated RF = 3, your secondary indexes in every node will have that indexed.

    Regd your Question 2:

    Query 1 : SELECT * FROM user WHERE user_name = 'test';
    

    This query is going to perform a scatter gather on all nodes in the cluster. Since the secondary indexes are local to each node, every node (in your case all 5) has to execute the query -> perform a secondary index lookup to figure out the partition key -> then fetch the actual results back to coordinator.

    As the table grows bigger, the query often results in timeout. In extreme cases it can bring down the node (just like "select *" without partition key). Hence secondary indexes and this type of query (without partition key) in general are discouraged in Cassandra and better to avoid them

    Query 2 : SELECT * FROM user WHERE user_group = 1 AND user_name = 'test';
    

    This query will perform better compared to the previous one, as it has filter on partition key. In the table definition above there is no clustering column, so this query would just filter on primary key as there is only one row per partition. Hence there isn't much improvement with secondary index. Overall its not a scatter gather type of query and hence perform much better.

    edited to explain query3

    Query 3 : select * from nodestat where uniqueId = 'test' AND flashMode = 'yes'
    

    In this query the secondary index is used in conjunction with partition key. This secondary index would help in case of 1000s of clustering columns exists for a given partition key and we want to quickly narrow down on the resultset. Remember the secondary index stores the entire primary key (partition key + clustering column reference). So in case of a wide partition, this secondary index proves useful when used alongside a partition key.

    For example in your case, say there is only one partition uniqueId = 'test'. But within that partition 'test', say there are 10000 different timestamp values (clustering column). So potentially there could be 10000 different values for "flashMode". This secondary index will help narrow down to the "flashMode" column with value "yes" within the partition 'test' amongst that 10000 matches.

    0 讨论(0)
提交回复
热议问题