Why many refer to Cassandra as a Column oriented database?

后端 未结 7 967
终归单人心
终归单人心 2020-12-07 10:10

Reading several papers and documents on internet, I found many contradictory information about the Cassandra data model. There are many which identify it as a column oriente

7条回答
  •  有刺的猬
    2020-12-07 10:32

    You both make good points and it can be confusing. In the example where

    apple -> colour  weight  price variety
             "red"   100     40    "Cox"
    

    apple is the key value and the column is the data, which contains all 4 data items. From what was described it sounds like all 4 data items are stored together as a single object then parsed by the application to pull just the value required. Therefore from an IO perspective I need to read the entire object. IMHO this is inherently row (or object) based not column based.

    Column based storage became popular for warehousing, because it offers extreme compression and reduced IO for full table scans (DW) but at the cost of increased IO for OLTP when you needed to pull every column (select *). Most queries don't need every column and due to compression the IO can be greatly reduced for full table scans for just a few columns. Let me provide an example

    apple -> colour  weight  price variety
             "red"   100     40    "Cox"
    
    grape -> colour  weight  price variety
             "red"   100     40    "Cox"
    

    We have two different fruits, but both have a colour = red. If we store colour in a separate disk page (block) from weight, price and variety so the only thing stored is colour, then when we compress the page we can achieve extreme compression due to a lot of de-duplication. Instead of storing 100 rows (hypothetically) in a page, we can store 10,000 colour's. Now to read everything with the colour red it might be 1 IO instead of thousands of IO's which is really good for warehousing and analytics, but bad for OLTP if I need to update the entire row since the row might have hundreds of columns and a single update (or insert) could require hundreds of IO's.

    Unless I'm missing something I wouldn't call this columnar based, I'd call it object based. It's still not clear on how objects are arranged on disk. Are multiple objects placed into the same disk page? Is there any way of ensuring objects with the same meta data go together? To the point that one fruit might contain different data than another fruit since its just meta data or xml or whatever you want to store in the object itself, is there a way to ensure certain matching fruit types are stored together to increase efficiency?

    Larry

提交回复
热议问题