loading avro files with different schemas into one bigquery table

问题

I have a set of avro files with slightly varying schemas which I'd like to load into one bq table.

Is there a way to do that with one line? Every automatic way to handle schema difference would be fine for me.

Here is what I tried so far.

0) If I try to do it in a straightforward way, bq fails with error:

bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*
Waiting on bqjob_r4e484dc546c68744_0000015bcaa30f59_1 ... (4s) Current status: DONE   
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r4e484dc546c68744_0000015bcaa30f59_1': The Apache Avro library failed to read data with the follwing error: EOF reached

1) Quick googling shows that there is --schema_update_option=ALLOW_FIELD_ADDITION option which, added to bq load job, changes nothing. ALLOW_FIELD_RELAXATION does not change anything either.

2) Actually, schema id is mentioned in the file name, so files look like:

gs://mybucket/logs/*_schemaA_*
gs://mybucket/logs/*_schemaB_*

Unfortunately, bq load does not allow more that on asterisk (as is written in bq manual too):

bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*_schemaA_*
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r5e14bb6f3c7b6ec3_0000015bcaa641f3_1': Not found: Uris gs://otishutin-eu/imp/2016-06-27/*_schemaA_*

3) When I try to list the files explicitly, the list happens to be too long, so bq load does not work either:

 bq load --source_format=AVRO myproject:mydataset.logs $(gsutil ls gs://mybucket/logs/*_schemaA_* | xargs | tr ' ' ',')
Too many positional args, still have ['gs://mybucket/logs/log_schemaA_2658.avro,gs://mybucket/logs/log_schemaA_2659.avro,gs://mybucket/logs/log_schemaA_2660.avro,...

4) When I try to use files as external table and list the files explicitly in external table definition, I also get "too many files" error:

BigQuery error in query operation: Table definition may not have more than 500 source_uris

I understand that I could first copy files to different folders and then process them folder-by-folder, and this is what I'm doing now as last resort, but this is only a small part of data processing pipeline, and copying is not acceptable as production solution.

来源：https://stackoverflow.com/questions/43746219/loading-avro-files-with-different-schemas-into-one-bigquery-table

标签

schema

google-bigquery

avro