Using subquery for _PARTITIONTIME in bigquery does not limit cost

非 Y 不嫁゛ 提交于 2019-12-11 03:55:39

问题


When I run the below query on BQ using standard SQL, it says it will process 76.6TB when running

SELECT 
event_time, user_id, activity_id,dbm_insertion_order_id, dbm_total_media_cost_usd 
FROM `raw.5295.activity_*`
WHERE _PARTITIONTIME >(SELECT * FROM `analytics-dwh.autobidding.activity_list` )
AND timestamp_micros(event_time)  > (SELECT timestamp_micros(MAX(event_time)) from `essence-analytics-dwh.ml_for_autobidding.nest_na_4q18_activity_updated_daily`)
AND _TABLE_SUFFIX IN ('25','20')

The table analytics-dwh.autobidding.activity_list has just one column with a unique list of integers

If I remove the subquery from the above table, the query uses less than 500GB when run

SELECT 
event_time, user_id, activity_id,dbm_insertion_order_id, dbm_total_media_cost_usd 
FROM `raw.5295.activity_*`
WHERE _PARTITIONTIME >TIMESTAMP('2018-12-20')
AND timestamp_micros(event_time)  > (SELECT timestamp_micros(MAX(event_time)) from `essence-analytics-dwh.ml_for_autobidding.nest_na_4q18_activity_updated_daily`)
AND _TABLE_SUFFIX IN ('25','20')

Why does this happen when I use a subquery? Is there a workaround?


回答1:


Why does this happen with use of subquery?

In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans.

You can see more about Limiting partitions queried using pseudo columns

So, in your first query (where you use subquery) - the pruning is not happening (It does not limit use of partitions based on the condition that involves the subquery)

In second query you use _PARTITIONTIME >TIMESTAMP('2018-12-20') so partitions are limited

Bottom line: filters on _PARTITIONTIME that include subqueries can't be used to limit the number of partitions scanned for a partitioned table.

Is there a workaround?

You should split your task to two steps: calculate filter for _PARTITIONTIME using whatever logic you need off of table analytics-dwh.autobidding.activity_list and then use it (instead of subquery) - using any client of your choice




回答2:


From looking at your query I assume you want to select the MIN or MAX date from this:

SELECT * FROM `analytics-dwh.autobidding.activity_list`

And pass it to the 2nd part of your query

This can be done by using BigQuery API inside a code which will pass the value between the 2 steps (For example Python or Javascript)



来源:https://stackoverflow.com/questions/54135893/using-subquery-for-partitiontime-in-bigquery-does-not-limit-cost

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!