Bigtable performance influence column families

We are currently investigating the influence of using multiple column families on the performance of our bigtable queries. We found that splitting the columns into multiple column families does not increase the performance. Does anyone have had similar experiences?

Some more details about our benchmark setup. At this moment each row in our production table contains around 5 columns, each containing between 0,1 to 1 KB of data. All columns are stored into one column family. When performing a row key range filter (which returns on average 340 rows) and apply a column regex fitler (which returns only 1 column for each row), the query takes on average 23,3ms. We created some test tables where we increased the amount of columns/data per row by a factor 5. In test table 1, we kept everything in one column family. As expected this increased the query time of that same query to 40,6ms. In test table 2 we kept the original data in one column family, but the extra data was put into another column family. When querying the column family containing the original data (thus containing the same amount of data as the original table), the query time was on average 44,3ms. So the performance even decreased when using more column families.

This is exactly the opposite of we would have expected. E.g. this is mentioned in the bigtable docs ( https://cloud.google.com/bigtable/docs/schema-design#column_families)

Grouping data into column families allows you to retrieve data from a single family, or multiple families, rather than retrieving all of the data in each row. Group data as closely as you can to get just the information that you need, but no more, in your most frequent API calls.

Anyone with an explanation for our findings?

benchmark results

(edit: added some more details)

The content of a single row:

Table 1:

cf1
- col1
- col2
- ...
- col25

Table 2:

cf1
- col1
- col2
- ..
- col5
cf2
- col6
- col7
- ..
- col25

The benchmark we are executing is using the go client. The code that calls the API looks basically as follows:

filter = bigtable.ChainFilters(bigtable.FamilyFilter(request.ColumnFamily),
            bigtable.ColumnFilter(colPattern), bigtable.LatestNFilter(1))
tbl := bf.Client.Open(table)
rr := bigtable.NewRange(request.RowKeyStart, request.RowKeyEnd)
err = tbl.ReadRows(c, rr, func(row bigtable.Row) bool {return true}, bigtable.RowFilter(filter))

If you are retrieving X cells per row, it does not make a major performance difference whether those cells are in X separate column families or 1 column family with X columns qualifiers.

The performance difference comes in if you only actually need cells for a row that have some specific purpose - you can the avoid selecting all cells for the row and instead just fetch one column family (by specifying a filter on the ReadRow call)

A more important factor is simply picking a schema that accuratly describes your data. If you do this any gain of the type above will come naturally. Also you will avoid hitting the 100 column family recommended limit.

For example: imagine you are writing leaderboard software, and you want to store scores a player has hit for each game and some personal details. Your schema might be:

Row Key: username
Column Family user_info
- Column Qualifier full_name
- Column Qualifier password_hash
Column Family game_scores
- Column Qualifier candy_royale
- Column Qualifier clash_of_tanks

Having each game stored as a separate column within the game_scores column family allows all scores for a user to be fetched at once without also fetching user_info, allows keeping the number of column families manageable, allows time series of scores for each game independently and other benefits from mirroring the nature of the data.

The reason why there is no speed up in performance when splitting data over multiple column families, is that they are stored in the same "locality group", i.e. file. Internally Google does offer the possibility to split different column families over different locality groups, but this isn't exposed in their managed Cloud Bigtable service. See the comments on this answer.

来源：https://stackoverflow.com/questions/46465762/bigtable-performance-influence-column-families

标签

bigdata

google-cloud-platform

google-cloud-bigtable