Column-family concept and data model

前端 未结 3 1183
迷失自我
迷失自我 2020-12-30 23:27

I\'m investigating the different types of NoSQL database types and I\'m trying to wrap my head around the data model of column-family stores, such as Bigtable, HBase and Cas

相关标签:
3条回答
  • 2020-12-30 23:35

    The Cassandra database follows your first model, I think. A ColumnFamily is a collection of rows, which can contain any columns, in a sparse fashion (so each row can have different collection of column names, if desired). The number of columns allowed in a row is almost unlimited (2 billion in Cassandra v0.7).

    A key point is that row keys must be unique within a column family, by definition - but can be re-used in other column families. So you can store unrelated data about the same key in different ColumnFamilies.

    In Cassandra this matters because the data in a particular column family is stored in the same files on disk - so it is more efficient to place data items that are likely to be retrieved together, in the same ColumnFamily. This is partly a practical speed concern, but also a matter of organising your data into a clear schema. This touches upon your second definition - one might consider all the data about a particular key to be a "row", but partitioned by Column Family. However, in Cassandra it is not really a single row, because the data in one ColumnFamily can be changed independently of the data in other ColumnFamilies for the same row key.

    0 讨论(0)
  • 2020-12-30 23:40

    Both models you've described are the same.

    Column family is:

    Key -> Key -> (Set of key/value pairs)
    

    Conceptually it becomes:

    Table -> Row -> (Column1/Value1, Column2/Value2, ...)
    

    Think of it as a Map of Map of Key/Value pairs.

    UserProfile = {
        Cassandra = [emailAddress:"cassandra@apache.org", age:20],
        TerryCho = [emailAddress:"terry.cho@apache.org", gender:"male"],
        Cath = [emailAddress:"cath@apache.org", age:20, gender:"female", address:"Seoul"],
    }
    

    The above is an example of a column family. If you were to tabulate it, you'd get a Table called UserProfile which looks like:

    UserName | Email | Age | Gender | Address
    Cassandra | cassandra@apache.org | 20 | null | null
    TerryCho | terry.cho@apache.org | null | male | null
    Cath | cath@apache.org | 20 | female | Seoul
    

    The confusing part is that there's not really a column or a row as we're used to thinking of them. There's a bunch of "column families" which are queried by name (the key). Those families contain a bunch of sets of key/value pairs, which are also queried by name (the row key), and finally, each value in the set can be looked up by name also (the column key).

    If you needed a tabular reference point, "column families" would be your "tables". Each "set of k/v pair" inside them would be your "rows". Each "pair of the set" would be the "column names and their values".

    Internally, the data inside each column familly is going to be stored together, and it'll be stored such that the rows are one after the other, and in each row, the columns are one after the other. So you get row1 -> col1/val1, col2/val2, ... , row2 -> col1/val1 ... , ... -> .... So in that sense, the data is stored much more like a row-store, and less so like a column-store.

    To finish, the choice of words here is just unfortunate and misleading. Columns in Column Families should have been called Attributes. Rows should have been called Attribute Sets. Column families should have been called Attributes families. The relation to classic tabular vocabulary is weak and misleading, since it's actually pretty different.

    0 讨论(0)
  • 2020-12-30 23:54

    As per my understanding, Cassandra ColumnFamily is not a collection of rows, rather it is cluster of columns. Column are clustered together based on clustering key. for example, lets consider below columnfamily:

    CREATE TABLE store (
      enrollmentId int,
      roleId int,
      name text,
      age int,
      occupation text,
      resume blob,
      PRIMARY KEY ((enrollmentId, roleId), name)
    ) ;
    
    
    INSERT INTO store (enrollmentid, roleid, name, age, occupation, resume)
    values (10293483, 01, 'John Smith', 26, 'Teacher', 0x7b22494d4549);
    

    Fetched inserted above details by using cassandra-cli, it is pretty well clustered based on clustering key, in this example "name = John Smith" is clustering key.

    RowKey: 10293483:1
    => (name=John Smith:, value=, timestamp=1415104618399000)
    => (name=John Smith:age, value=0000001a, timestamp=1415104618399000)
    => (name=John Smith:occupation, value=54656163686572, timestamp=1415104618399000)
    => (name=John Smith:resume, value=7b22494d4549, timestamp=1415104618399000)
    
    0 讨论(0)
提交回复
热议问题