Cassandra storage internal

前端未结

关注

 1  1132

I\'m trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.

CREATE TABLE log_date


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  没有蜡笔的小新        
                
              
                            
                2020-12-17 03:58
              
            
            
                                                                       

  I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.


Let's say that I build tables with both of your PRIMARY KEYs, and INSERT some data:

aploetz@cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date1;

 userid | time                                 | dateof(time)             | category | subcategory    | itemid            | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
   1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 |    Books |         Novels | 678-2-44398-312-9 |     1 |   798
   1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 |    Audio |     Headphones | 228-5-44343-344-5 |     1 |  4799
   1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 |    Books | Computer Books | 978-1-78398-912-6 |     1 |  2200
   1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 |    Books |         Novels | 678-2-44398-312-9 |     1 |   798
   1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 |    Books | Computer Books | 977-8-78998-466-4 |     1 |   599

(5 rows)
aploetz@cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date2;

 userid | time                                 | dateof(time)             | category | subcategory    | itemid            | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
   1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 |    Books |         Novels | 678-2-44398-312-9 |     1 |   798
   1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 |    Audio |     Headphones | 228-5-44343-344-5 |     1 |  4799
   1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 |    Books | Computer Books | 978-1-78398-912-6 |     1 |  2200
   1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 |    Books |         Novels | 678-2-44398-312-9 |     1 |   798
   1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 |    Books | Computer Books | 977-8-78998-466-4 |     1 |   599

(5 rows)


Looks pretty much the same via cqlsh.  So let's have a look from the cassandra-cli, and query all rows foor userid 1002:

RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:category, value=426f6f6b73, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:itemid, value=3637382d322d34343339382d3331322d39, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:price, value=0000031e, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:subcategory, value=4e6f76656c73, timestamp=1431092900008568)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:category, value=417564696f, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:itemid, value=3232382d352d34343334332d3334342d35, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:price, value=000012bf, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:subcategory, value=4865616470686f6e6573, timestamp=1431092985326774)


Simple enough, right?  We see userid 1002 as the RowKey, and our clustering column of time as a column key.  Following that, are all of our columns for each column key (time).  And I believe your first instance generates 6 columns, as I'm pretty sure that includes the placeholder for the column key, because your PRIMARY KEY could point to an empty value (as your 2nd example key does).

But what about the 2nd version for userid 1002?

RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:Books:Novels:678-2-44398-312-9:1:798:, value=, timestamp=1431093011349994)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:Audio:Headphones:228-5-44343-344-5:1:4799:, value=, timestamp=1431093011360402)


Two columns are returned for RowKey 1002, one for each unique combination of our column (clustering) keys, with an empty value (as mentioned above).

So what does this all mean for you?  Well, a few things:


This should tell you that PRIMARY KEYs in Cassandra ensure uniqueness.  So if you decide that you need to update key values like category or subcategory (2nd example) that you really can't unless you DELETE and recreate the row.  Although from a logging perspective, that's probably ok.
Cassandra stores all data for a particular partition/row key (userid) together, sorted by the column (clustering) keys.  If you were concerned about querying and sorting your data, it would be important to understand that you would have to query for each specific userid for sort order to make any difference.
The biggest issue I see, is that right now you are setting yourself up for unbounded column growth.  Partition/row keys can support a maximum of 2 billion columns, so your 2nd example will help you out the most there.  If you think some of your userids might exceed that, you could implement a "date bucket" as an additional partition key (say, if you knew that a userid would never exceed more than 2 billion in a year, or whatever).


It looks to me like your 2nd option might be the better choice.  But honestly for what you're doing, either of them will probably work ok.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复