问题
I have a a table UNITARCHIVE partitionned by date, and clustered by UNIT, DUID.
the total size of the table 892 Mb.
when I try this query
SELECT * FROM `test-187010.ReportingDataset.UNITARCHIVE` WHERE duid="RRSF1" and unit="DUNIT"
Bigquery tell me, it will process 892 mb, I thought clustering is supposed to reduce the scanned size, I understand when I filter per date, the size is reduced dramatically, but i need the whole date range. is it by design or am I doing something wrong
回答1:
To get the most benefits out of clustering, each partition needs to have a certain amount of data.
For example, if the minimum size of a cluster is 100MB (decided internally by BigQuery), and you have only 100MB of data per day, then querying 100 days will scan 100*100MB - regardless of the clustering strategy.
As an alternative with this amount of data, instead of partitioning by day, partition by year. Then you'll get the most benefits out of clustering with a low amount of data per day.
See Partition by week/year/month to get over the partition limit? for a reference table that shows this off.
来源:https://stackoverflow.com/questions/57966914/how-clustering-works-in-bigquery