I am trying to run a query on a 12 GB csv file loaded in Google big query, I cant run any query on the dataset. I am not sure if the dataset is loaded correctly. It shows a
job.errors contains detailed errors for the job.
This doesn't appear to be documented anywhere, but you can see it in the source code: https://googlecloudplatform.github.io/google-cloud-python/0.20.0/_modules/google/cloud/bigquery/job.html and ctrl+f for _AsyncJob.
So your wait_for_job code could look like this:
def wait_for_job(job):
while True:
job.reload()
if job.state == 'DONE':
if job.error_result:
raise RuntimeError(job.errors)
return
time.sleep(1)
To get more info on the errors try this from the CLI:
>bq show -j <jobid>
It prints the status and/or detailed error information.
To list all the jobids:
bq ls -j
I had the same issue following the instructions in the GCP docs.
It failed on the second bq load
, but not the first.
I found that repeating the job in the BigQuery web interface selecting the ignore unknown values
option.
I have not spotted any errors with the data yet, but just getting started looking at it.
Another trick: If you use csv files with a header line and want to load with a defined schema, you need to add option --skip_leading_rows=1
to submit command(example: bq load --skip_leading_rows=1 --source_format=CSV ...
).
Without this option, Bigquery will parse your first row(header line) as an data row, may lead to TYPE MISMATCH ERROR (your defined schema of a column is FLOAT, but its column name is STRING, and bq load
command parses your column name as a FLOAT value).
Seems to be a known bug @google. The already made the fix, but did not push it in production. https://code.google.com/p/google-bigquery/issues/detail?id=621
So it looks like you're querying against a CSV file that hasn't been loaded into BigQuery, it is just being pointed to by a federated table that lives in Google Cloud Storage.
It looks like there were errors in the underlying CSV file:
Too many value in row starting at position:11398444388 in file:gs://syntheticpopulation-storage/Alldatamerged_Allgrps.csv
Too many value in row starting at position:9252859186 in file:gs://syntheticpopulation-storage/Alldatamerged_Allgrps.csv
...
Please let me know if this is enough to diagnose the issue. I believe you can see those messages as warnings on the query job if you look at the query history.
I've filed three bugs internally: