Cassandra: choosing a Partition Key

♀尐吖头ヾ 提交于 2019-11-28 04:33:01

Indexing in the documentation you wrote up refers to secondary indexes. In cassandra there is a difference between the primary and secondary indexes. For a secondary index it would indeed be bad to have very unique values, however for the components in a primary key this depends on what component we are focusing on. In the primary key we have these components:

PRIMARY KEY(partitioning key, clustering key_1 ... clustering key_n)

The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible. That is why the example you have uses UUIDs.

The clustering key is used for ordering so that querying columns with a particular clustering key can be more efficient. That is where you want your values to not be unique and where there would be a performance hit if unique rows were frequent.

The cql docs have a good explanation of what is going on.

if you use cql3, given a column family:

CREATE TABLE table1 (   a1 text,   a2 text,   b1 text,   b2 text,   c1 text,   c2 text,   PRIMARY KEY ( (a1, a2), b1, b2) ) ); 

by defining a primary key ( (a1, a2, ...), b1, b2, ... )

This implies that:

a1, a2, ... are fields used to craft a row key in order to:

  • determine how the data is partitioned
  • determine what is phisically stored in a single row
  • referred as row key or partition key

b1, b2, ... are column family fields used to cluster a row key in order to:

  • create logical sets inside a single row
  • allow more flexible search schemes such as range range
  • referred as column key or cluster key

All the remaining fields are effectively multiplexed / duplicated for every possible combination of column keys. Here below an example about composite keys with partition keys and clustering keys work.

If you want to use range queries, you can use secondary indexes or (starting from cql3) you can declare those fields as clustering keys. In terms of speed having them as clustering key will create a single wide row. This has impact on speed since you will fetch multiple clustering key values such as:

select * from accounts where Country>'Italy' and Country<'Spain'

I am sure you would have got the answer but still this can help you for better understanding.

CREATE TABLE table1 (   a1 text,   a2 text,   b1 text,   b2 text,   c1 text,   c2 text,   PRIMARY KEY ( (a1, a2), b1, b2) ) ); 

here the partition keys are (a1, a2) and row keys are b1,b2.

combination of both partition keys and row keys must be unique for each new record entry.

the above primary key can be define like this.

Node< key, value>  Node<(a1a2), Map< b1b2, otherColumnValues>> 

as we know Partition Key is responsible for data distribution accross your nodes.

So if you are inserting 100 records in table1 with same partition keys and different row keys. it will store data in same node but in different columns.

logically we can represent like this.

Node<(a1a2), Map< string1, otherColumnValues>, Map< string2, otherColumnValues> .... Map< string100, otherColumnValues>> 

So the record will store sequentially in memory.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!