问题
I have a federated source in BigQuery which is pointing to some CSV files in GCS.
When I try to read to the federated BigQuery table as a source for a Dataflow pipeline, it throws the following error:
1226 [main] ERROR com.google.cloud.dataflow.sdk.util.BigQueryTableRowIterator - Error reading from BigQuery table Federated_test_dataflow of dataset CPT_7414_PLAYGROUND : 400 Bad Request
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "Cannot list a table of type EXTERNAL.",
"reason" : "invalid"
} ],
"message" : "Cannot list a table of type EXTERNAL."
}
Does Dataflow not support federated sources in BigQuery, or am I doing something wrong? I do know that I could read the files from GCS directly into my pipeline, but I'd prefer to work with BigQuery TableRow
objects instead due to the design of the application.
PCollection<TableRow> results = pipeline.apply("fed-test", BigQueryIO.Read.from("<project_id>:CPT_7414_PLAYGROUND.Federated_test_dataflow")).apply(ParDo.of(new DoFn<TableRow, TableRow>() {
@Override
public void processElement(ProcessContext c) throws Exception {
System.out.println(c.element());
}
}));
回答1:
As Michael says, BigQuery does not support directly reading from EXTERNAL (federated tables) or VIEWs: even reading effectively takes a query.
To read from these tables in Dataflow, you can instead use
BigQueryIO.Read.fromQuery("SELECT * FROM table_or_view_name")
which will issue the query and save the result to a temporary table, and then begin the read process. Of course, this will incur the costs of querying on BigQuery, so if you wish to read from the same VIEW or EXTERNAL table repeatedly you may want to manually create the table.
回答2:
The Dataflow BigQuery source was designed to read BigQuery managed tables of type "TABLE". (The type definition can be found at https://cloud.google.com/bigquery/docs/reference/v2/tables#type.) EXTERNAL and VIEW tables are not supported.
The BigQuery "federated table" feature allows bigquery to directly query data in places like Google Cloud Storage. Dataflow can also read files from Google Cloud Storage, so you should be able to point your Dataflow computation directly at the sources you want to read.
来源:https://stackoverflow.com/questions/36193519/reading-bigquery-federated-table-as-source-in-dataflow-throws-an-error