问题
BigTable uses Bloom filters to allow point reads to avoid accessing SSTables that do not contain any data within a given key-column pair. Can these Bloom filters also be used to avoid accessing SSTables if the query only specifies the row ID and no column ID?
BigTable uses row-column pairs as keys to insert into its bloom filters. This means that a query can use these filters for a point read that specifies a row-column pair.
Now, suppose we have a query to get all columns of a row based only on the row ID. As far as I can tell, this query does not know in advance what are the columns that belong to the row, and so it may not be able to use the bloom filters as it cannot enumerate the possible row-column pairs. As a result, such a query may not be able to use the bloom filters, and so it would be less efficient.
In theory, BigTable could already be addressing this problem by also inserting just the row ID into the bloom filters, but I can't tell if the current implementation does this or not.
This question may have importance for designing efficient queries to run on BigTable. Any hints would be wonderful.
回答1:
HBase Bloom filter does both row and row col checks. HBase was built based on BigTable paper, so most probably BigTable would be doing the same.
HBase Bloom Filter is a space-efficient mechanism to test whether a StoreFile contains a specific row or row-col cell.
Reference: https://learning.oreilly.com/library/view/hbase-administration-cookbook/9781849517140/ch09s11.html
The BigTable paper from 2006 however does mention only row-column based search using bloom filter.
https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
来源:https://stackoverflow.com/questions/54280508/can-bloom-filters-in-bigtable-be-used-to-filter-based-only-on-row-id