问题
When you need to read all the data from one or more tables in bigquery in a dataflow job there are two approaches to it I would say. The first one is to use BigQueryIO
with from
, which reads the table in question, and the second approach is to use fromQuery
where you specify a query that reads all the data from the same table. So my question is:
- Is it any cost or performance benefit for using one over the other?
I haven't find anything in the docs about this, but I would really like to know. I imagine that maybe read
is faster since you don't need to run a query that scans the data, meaning it is more similar to the preview functionality you have in BigQuery
UI. If that is true it might also be much cheaper, but it make sense if they both cost the same.
So in short, what is the difference between:
BigQueryIO.read(...).from(tableName)
And
BigQueryIO.read(...).fromQuery("SELECT * FROM " + tableName)
回答1:
from
is both cheaper and faster than fromQuery(SELECT * FROM ...)
.
from
directly exports the table and exporting data is free for BigQuery.fromQuery(SELECT * FROM ...)
will first scan the entire table ($5/TB) and export the result.
来源:https://stackoverflow.com/questions/48486338/is-there-a-difference-in-bigqueryio-when-you-use-fromtable-vs-fromqueryse