Understanding Cassandra's storage overhead

前端未结

关注

 1  1877

情歌与酒 2021-02-15 14:52

I have been reading this section of the Cassandra docs and found the following a little puzzling:

Determine column overhead:

regular_total_column_size

1条回答

孤城傲影 (楼主)

2021-02-15 15:30

The easiest way to figure out what is going on in situations like this is to check the sstable2json (cassandra/bin) representation of your data. This will show you what ends up actually be saved on disk.

Here is the example for your situation

 [
 {"key": "4b6579","columns": [
       ["rid1:ssid1:","",1401469033325000],
       ["rid1:ssid1:end_date","2004-10-03 00:00:00-0700",1401469033325000],
       ["rid1:ssid1:report_date","2004-10-03 00:00:00-0700",1401469033325000],
       ["rid1:ssid1:start_date","2004-10-03 00:00:00-0700",1401469033325000], 
       ["rid1:ssid1:subset_descr","descr",1401469033325000],
       ["rid1:ssid1:x","1",1401469033325000], 
       ["rid1:ssid1:y","5.5",1401469033325000],
       ["rid1:ssid1:z","1",1401469033325000],
       ["rid2:ssid2:","",1401469938599000],
       ["rid2:ssid2:end_date", "2004-10-03 00:00:00-0700",1401469938599000],
       ["rid2:ssid2:report_date","2004-10-03 00:00:00-0700",1401469938599000],
       ["rid2:ssid2:start_date","2004-10-03 00:00:00-0700",1401469938599000], 
       ["rid2:ssid2:subset_descr","descr",1401469938599000],
       ["rid2:ssid2:x","1",1401469938599000],
       ["rid2:ssid2:y","5.5",1401469938599000],
       ["rid2:ssid2:z","1",1401469938599000]
 }
 ]

The value of the partition key is saved once per partition (per sstable) as you can see above, the column name in this case doesn't matter at all since it is implicit given the table. The column names for the clustering columns are also not present because with C* you aren't allowed to insert without specifying all portions of the key.

Whats left though does have the column name, this is needed incase a partial update to a row is made so it can be saved without the rest of the row information. You could imagine an update to a single column field in a row, to indicate which field this is C* currently uses the column name but there are tickets to change this to a smaller representation. https://issues.apache.org/jira/browse/CASSANDRA-4175

To generate this

cqlsh
CREATE TABLE mykeyspace.mytable(   id text,   report_id text,   subset_id text,   report_date timestamp,   start_date timestamp,   end_date timestamp,   subset_descr text,   x int,   y double,   z int,   PRIMARY KEY (id, report_id, subset_id) );
INSERT INTO mykeyspace.mytable (id, report_id , subset_id , report_date , start_date , end_date , subset_descr ,x, y, z) VALUES ( 'Key', 'rid1','ssid1', '2004-10-03','2004-10-03','2004-10-03','descr',1,5.5,1);
INSERT INTO mykeyspace.mytable (id, report_id , subset_id , report_date , start_date , end_date , subset_descr ,x, y, z) VALUES ( 'Key', 'rid2','ssid2', '2004-10-03','2004-10-03','2004-10-03','descr',1,5.5,1);
exit;
nodetool flush
bin/sstable2json $DATA_DIR/mytable/mykeyspace-mytable-jb-1-Data.db

0 讨论(0)