问题
When I run the below query on BQ using standard SQL, it says it will process 76.6TB when running
SELECT
event_time, user_id, activity_id,dbm_insertion_order_id, dbm_total_media_cost_usd
FROM `raw.5295.activity_*`
WHERE _PARTITIONTIME >(SELECT * FROM `analytics-dwh.autobidding.activity_list` )
AND timestamp_micros(event_time) > (SELECT timestamp_micros(MAX(event_time)) from `essence-analytics-dwh.ml_for_autobidding.nest_na_4q18_activity_updated_daily`)
AND _TABLE_SUFFIX IN ('25','20')
The table analytics-dwh.autobidding.activity_list has just one column with a unique list of integers
If I remove the subquery from the above table, the query uses less than 500GB when run
SELECT
event_time, user_id, activity_id,dbm_insertion_order_id, dbm_total_media_cost_usd
FROM `raw.5295.activity_*`
WHERE _PARTITIONTIME >TIMESTAMP('2018-12-20')
AND timestamp_micros(event_time) > (SELECT timestamp_micros(MAX(event_time)) from `essence-analytics-dwh.ml_for_autobidding.nest_na_4q18_activity_updated_daily`)
AND _TABLE_SUFFIX IN ('25','20')
Why does this happen when I use a subquery? Is there a workaround?
回答1:
Why does this happen with use of subquery?
In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans.
You can see more about Limiting partitions queried using pseudo columns
So, in your first query (where you use subquery) - the pruning is not happening (It does not limit use of partitions based on the condition that involves the subquery)
In second query you use _PARTITIONTIME >TIMESTAMP('2018-12-20')
so partitions are limited
Bottom line: filters on _PARTITIONTIME that include subqueries can't be used to limit the number of partitions scanned for a partitioned table.
Is there a workaround?
You should split your task to two steps: calculate filter for _PARTITIONTIME using whatever logic you need off of table analytics-dwh.autobidding.activity_list
and then use it (instead of subquery) - using any client of your choice
回答2:
From looking at your query I assume you want to select the MIN or MAX date from this:
SELECT * FROM `analytics-dwh.autobidding.activity_list`
And pass it to the 2nd part of your query
This can be done by using BigQuery API inside a code which will pass the value between the 2 steps (For example Python or Javascript)
来源:https://stackoverflow.com/questions/54135893/using-subquery-for-partitiontime-in-bigquery-does-not-limit-cost