Why many refer to Cassandra as a Column oriented database?

后端 未结 7 969
终归单人心
终归单人心 2020-12-07 10:10

Reading several papers and documents on internet, I found many contradictory information about the Cassandra data model. There are many which identify it as a column oriente

相关标签:
7条回答
  • 2020-12-07 10:54

    IMO that's the wrong term used for Cassandra. Instead, it is more appropriate to call it row-partition store. Let me provide you some details on it:

    Primary Key, Partitioning Key, Clustering Columns, and Data Columns:

    Every table must have a primary key with unique constraint.

    Primary Key = Partition key + Clustering Columns
    
    # Example
    Primary Key: ((col1, col2), col3, col4)     # primary key uniquely identifies a row
                                                # we need to choose its components partition key
                                                # and clustering columns so that each row can be
                                                # uniquely identified
    Partition Key: (col1, col2)                 # decides on which node to store the data
                                                # partitioning key is mandatory, and it
                                                # can be made up of one column or multiple
    Clustering Columns: col3, col4              # decides arrangement within a partition
                                                # clustering columns are optional
    

    Partition key is the first component of Primary key. Its hashed value is used to determine the node to store the data. The partition key can be a compound key consisting of multiple columns. We want almost equal spreads of data, and we keep this in mind while choosing primary key.

    Any fields listed after the Partition Key in Primary Key are called Clustering Columns. These store data in ascending order within the partition. The clustering column component also helps in making sure the primary key of each row is unique.

    You can use as many clustering columns as you would like. You cannot use the clustering columns out of order in the SELECT statement. You may choose to omit using a clustering column in you SELECT statement. That's OK. Just remember to sue them in order when you are using the SELECT statement. But note that, in your CQL query, you can not try to access a column or a clustering column if you have not used the other defined clustering columns. For example, if primary key is (year, artist_name, album_name) and you want to use city column in your query's WHERE clause, then you can use it only if your WHERE clause makes use of all of the columns which are part of primary key.

    Tokens:

    Cassandra uses tokens to determine which node holds what data. A token is a 64-bit integer, and Cassandra assigns ranges of these tokens to nodes so that each possible token is owned by a node. Adding more nodes to the cluster or removing old ones leads to redistributing these token among nodes.

    A row's partition key is used to calculate a token using a given partitioner (a hash function for computing the token of a partition key) to determine which node owns that row.

    Cassandra is Row-partition store:

    Row is the smallest unit that stores related data in Cassandra.

    Don't think of Cassandra's column family (that is, table) as a RDBMS table, but think of it as a dict of a dict (here dict is data structure similar to Python's OrderedDict):

    • the outer dict is keyed by a row key (primary key): this determines which partition and which row in partition
    • the inner dict is keyed by a column key (data columns): this is data in dict with column names as keys
    • both dict are ordered (by key) and are sorted: the outer dict is sorted by primary key

    This model allows you to omit columns or add arbitrary columns at any time, as it allows you to have different data columns for different rows.

    0 讨论(0)
提交回复
热议问题