I\'m investigating the different types of NoSQL database types and I\'m trying to wrap my head around the data model of column-family stores, such as Bigtable, HBase and Cas
The Cassandra database follows your first model, I think. A ColumnFamily is a collection of rows, which can contain any columns, in a sparse fashion (so each row can have different collection of column names, if desired). The number of columns allowed in a row is almost unlimited (2 billion in Cassandra v0.7).
A key point is that row keys must be unique within a column family, by definition - but can be re-used in other column families. So you can store unrelated data about the same key in different ColumnFamilies.
In Cassandra this matters because the data in a particular column family is stored in the same files on disk - so it is more efficient to place data items that are likely to be retrieved together, in the same ColumnFamily. This is partly a practical speed concern, but also a matter of organising your data into a clear schema. This touches upon your second definition - one might consider all the data about a particular key to be a "row", but partitioned by Column Family. However, in Cassandra it is not really a single row, because the data in one ColumnFamily can be changed independently of the data in other ColumnFamilies for the same row key.
Both models you've described are the same.
Column family is:
Key -> Key -> (Set of key/value pairs)
Conceptually it becomes:
Table -> Row -> (Column1/Value1, Column2/Value2, ...)
Think of it as a Map of Map of Key/Value pairs.
UserProfile = {
Cassandra = [emailAddress:"cassandra@apache.org", age:20],
TerryCho = [emailAddress:"terry.cho@apache.org", gender:"male"],
Cath = [emailAddress:"cath@apache.org", age:20, gender:"female", address:"Seoul"],
}
The above is an example of a column family. If you were to tabulate it, you'd get a Table called UserProfile which looks like:
UserName | Email | Age | Gender | Address
Cassandra | cassandra@apache.org | 20 | null | null
TerryCho | terry.cho@apache.org | null | male | null
Cath | cath@apache.org | 20 | female | Seoul
The confusing part is that there's not really a column or a row as we're used to thinking of them. There's a bunch of "column families" which are queried by name (the key). Those families contain a bunch of sets of key/value pairs, which are also queried by name (the row key), and finally, each value in the set can be looked up by name also (the column key).
If you needed a tabular reference point, "column families" would be your "tables". Each "set of k/v pair" inside them would be your "rows". Each "pair of the set" would be the "column names and their values".
Internally, the data inside each column familly is going to be stored together, and it'll be stored such that the rows are one after the other, and in each row, the columns are one after the other. So you get row1 -> col1/val1, col2/val2, ... , row2 -> col1/val1 ... , ... -> ...
. So in that sense, the data is stored much more like a row-store, and less so like a column-store.
To finish, the choice of words here is just unfortunate and misleading. Columns in Column Families should have been called Attributes. Rows should have been called Attribute Sets. Column families should have been called Attributes families. The relation to classic tabular vocabulary is weak and misleading, since it's actually pretty different.
As per my understanding, Cassandra ColumnFamily is not a collection of rows, rather it is cluster of columns. Column are clustered together based on clustering key. for example, lets consider below columnfamily:
CREATE TABLE store (
enrollmentId int,
roleId int,
name text,
age int,
occupation text,
resume blob,
PRIMARY KEY ((enrollmentId, roleId), name)
) ;
INSERT INTO store (enrollmentid, roleid, name, age, occupation, resume)
values (10293483, 01, 'John Smith', 26, 'Teacher', 0x7b22494d4549);
Fetched inserted above details by using cassandra-cli, it is pretty well clustered based on clustering key, in this example "name = John Smith" is clustering key.
RowKey: 10293483:1
=> (name=John Smith:, value=, timestamp=1415104618399000)
=> (name=John Smith:age, value=0000001a, timestamp=1415104618399000)
=> (name=John Smith:occupation, value=54656163686572, timestamp=1415104618399000)
=> (name=John Smith:resume, value=7b22494d4549, timestamp=1415104618399000)