Cannot insert new value to BigQuery table after updating with new column using streaming API

前端 未结 2 1228
别跟我提以往
别跟我提以往 2021-02-15 14:45

I\'m seeing some strange behaviour with my bigquery table, I\'ve just created added a new column to a table, it looks good on the interface and getting the schema via the api.

相关标签:
2条回答
  • 2021-02-15 15:24

    I was running into this error. It turned out that I was building the insert object like i was in "raw" mode but had forgotten to set the flag raw: true. This caused bigQuery to take my insert data and nest it again under a json: {} node.

    In otherwords, I was doing this:

    table.insert({
        insertId: 123,
        json: {
            col1: '1',
            col2: '2',
        }
    });
    

    when I should have been doing this:

    table.insert({
        insertId: 123,
        json: {
            col1: '1',
            col2: '2',
        }
    }, {raw: true});
    

    the node bigquery library didn't realize that it was already in raw mode and was then trying to insert this:

    {
        insertId: '<generated value>',
        json: {
            insertId: 123,
            json: {
                col1: '1',
                col2: '2',
         }
    }
    

    So in my case the errors were referring to the fact that the insert was expecting my schema to have 2 columns in it (insertId and json).

    0 讨论(0)
  • 2021-02-15 15:40

    Updating this answer since BigQuery's streaming system has seen significant updates since Aug 2014 when this question was originally answered.


    BigQuery's streaming system caches the table schema for up to 2 minutes. When you add a field to the schema and then immediately stream new rows to the table, you may encounter this error.

    The best way to avoid this error is to delay streaming rows with the new field for 2 minutes after modifying your table.

    If that's not possible, you have a few other options:

    1. Use the ignoreUnknownValues option. This flag will tell the insert operation to ignore unknown fields, and accept only those fields that it recognizes. Setting this flag allows you to start streaming records with the new field immediately while avoiding the "no such field" error during the 2 minute window--but note that the new field values will be silently dropped until the cached table schema updates!

    2. Use the skipInvalidRows option. This flag will tell the insert operation to insert as many rows as it can, instead of failing the entire operation when a single invalid row is detected. This option is useful if only some of your data contains the new field, since you can continue inserting rows with the old format, and decide separately how to handle the failed rows (either with ignoreUnknownValues or by waiting for the 2 minute window to pass).

    If you must capture all values and cannot wait for 2 minutes, you can create a new table with the updated schema and stream to that table. The downside to this approach is that you need to manage multiple tables generated by this approach. Note that you can query these tables conveniently using TABLE_QUERY, and you can run periodic cleanup queries (or table copies) to consolidate your data into a single table.

    Historical note: A previous version of this answer suggested that users stop streaming, move the existing data to another table, re-create the streaming table, and restart streaming. However, due to the complexity of this approach and the shortened window for the schema cache, this approach is no longer recommended by the BigQuery team.

    0 讨论(0)
提交回复
热议问题