问题
I'm loading data from Google Storage to bigQuery using GoogleCloudStorageToBigQueryOperator
It may be that the Json file will have more columns than what I defined. In that case I want the load job continue - simply ignore this unrecognized column.
I tried to use the ignore_unknown_values
argument but it didn't make any difference.
My operator:
def dc():
return [
{
"name": "id",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "storeId",
"type": "INTEGER",
"mode": "NULLABLE"
},
...
]
gcs_to_bigquery_st = GoogleCloudStorageToBigQueryOperator(
dag=dag,
task_id='load_to_BigQuery_stage',
bucket=GCS_BUCKET_ID,
destination_project_dataset_table=table_name_template_st,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[gcs_export_uri_template],
ignore_unknown_values = True,
schema_fields=dc(),
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
skip_leading_rows = 1,
google_cloud_storage_conn_id=CONNECTION_ID,
bigquery_conn_id=CONNECTION_ID
)
The error:
u'Error while reading data, error message: JSON parsing error in row starting at position 0: No such field: shippingService.',
which is true. shippingService doesn't exist and it won't be added to the table.
How can I fix this?
Edit:
Removed the schema_fields=dc()
from the operator:
gcs_to_bigquery_st = GoogleCloudStorageToBigQueryOperator(
dag=dag,
task_id='load_to_BigQuery_stage',
bucket=GCS_BUCKET_ID,
destination_project_dataset_table=table_name_template_st,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[gcs_export_uri_template],
ignore_unknown_values = True,
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
skip_leading_rows = 1,
google_cloud_storage_conn_id=CONNECTION_ID,
bigquery_conn_id=CONNECTION_ID
)
Still gives the same error. This doesn't make scene.. It has command to ignore unknown values :(
回答1:
The only reason I can think of is you are probably using Airflow 1.9. This feature was added in Airflow 1.10.
However, you can use it as follows in Airflow 1.9 by adding src_fmt_configs={'ignoreUnknownValues': True}
:
gcs_to_bigquery_st = GoogleCloudStorageToBigQueryOperator(
dag=dag,
task_id='load_to_BigQuery_stage',
bucket=GCS_BUCKET_ID,
destination_project_dataset_table=table_name_template_st,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[gcs_export_uri_template],
src_fmt_configs={'ignoreUnknownValues': True},
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
skip_leading_rows = 1,
google_cloud_storage_conn_id=CONNECTION_ID,
bigquery_conn_id=CONNECTION_ID
)
来源:https://stackoverflow.com/questions/52499560/how-to-ignore-an-unknown-column-when-loading-to-bigquery-using-airflow