How to best process large query results written to intermediate table in App Engine

前端 未结 3 2012
情书的邮戳
情书的邮戳 2021-01-03 08:30

We are running large query jobs where we hit the 128M response size and BigQuery raises the \"Response too large to return. Consider setting allowLargeResults to true in you

相关标签:
3条回答
  • 2021-01-03 09:08

    It is not clear what is exactly the reason for "chunk processing". If you have some complex SQL logic that needs to be run against your data and result happend to be bigger than current 128MB limit - just still do it with allowLargeResults and than consume result the way you need to consume it. Of course you most likely have reason for chunking but it is not understood and thus making answering problematic
    Another suggestion is to not to ask many questions in one question - this make answering very problematic, so you have great chance to not to get it answered

    Finally, my answer for the only question that is relatively clear (for me at least)
    

    The question here is what is the best way to chunk the rows (is there an efficient way to do this, e.g. is there an internal row number we can refer to?). We probably end up scanning the entire table for each chunk so this seems more costly than the export to GCS option

    It depends on how and when your table was created!
    If your table was loaded as a one big load - i dont see way of avoiding table scan again and again.
    If table was loaded in increments and recently - you have chance to enjoy so called Table Decoration (specifically you should look to Range Decorators)

    In the very early era of BigQuery there was an expectation of having Partitioned Decorators - this would address a lot of users needs - but it is still not available and i dont know what are the plans on them

    0 讨论(0)
  • 2021-01-03 09:22

    I understand that https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list willl let you read chunks of a table without performing a query (incurring data processing charges).

    This lets you read the results of a query in parallel as well as all queries are written to a temporary table id which you can pass to this function and supply different ranges (with startIndex,maxResults).

    0 讨论(0)
  • 2021-01-03 09:29

    Note that BigQuery is able to export data in chunks - and you can request as many chunks as workers you have.

    From https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple:

    If you ask to export to:

    ['gs://my-bucket/file-name.json']
    

    you will get an export in one GCS file, as long as it's less than 1GB.

    If you ask to export to:

    ['gs://my-bucket/file-name-*.json']
    

    you will get several files with each having a chunk of the total export. Useful when exporting more than 1GB.

    If you ask to export to:

    ['gs://my-bucket/file-name-1-*.json',
    'gs://my-bucket/file-name-2-*.json',
    'gs://my-bucket/file-name-3-*.json']
    

    you will get exports optimized for 3 workers. Each of these patterns will receive a series of exported chunks, so each worker can focus on its own chunks.

    0 讨论(0)
提交回复
热议问题