Is there a difference in `BigQueryIO` when you use `fromTable` vs `fromQuery(“SELECT * …”)` in dataflow?

狂风中的少年 提交于 2019-12-08 08:10:29

问题


When you need to read all the data from one or more tables in bigquery in a dataflow job there are two approaches to it I would say. The first one is to use BigQueryIO with from, which reads the table in question, and the second approach is to use fromQuery where you specify a query that reads all the data from the same table. So my question is:

  • Is it any cost or performance benefit for using one over the other?

I haven't find anything in the docs about this, but I would really like to know. I imagine that maybe read is faster since you don't need to run a query that scans the data, meaning it is more similar to the preview functionality you have in BigQuery UI. If that is true it might also be much cheaper, but it make sense if they both cost the same.

So in short, what is the difference between:

BigQueryIO.read(...).from(tableName)

And

BigQueryIO.read(...).fromQuery("SELECT * FROM " + tableName)

回答1:


from is both cheaper and faster than fromQuery(SELECT * FROM ...).

  • from directly exports the table and exporting data is free for BigQuery.
  • fromQuery(SELECT * FROM ...) will first scan the entire table ($5/TB) and export the result.


来源:https://stackoverflow.com/questions/48486338/is-there-a-difference-in-bigqueryio-when-you-use-fromtable-vs-fromqueryse

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!