BigQuery flattens when using field with same name as repeated field

前端 未结 3 624
Happy的楠姐
Happy的楠姐 2021-01-26 22:10

Edited to use public dataset

I have a table with the following schema, which you can access here: https://bigquery.cloud.google.com/table/reals

相关标签:
3条回答
  • 2021-01-26 22:51

    Thanks for sharing a dataset @alan! Let's see how it looks:

    It's an interesting table: It has 3 columns and 3 rows (tiny, but a normal SQL table). The interesting part is that the 3rd column can host nested records. On the first row it has nothing (null), the second row only has 1 value, and the third row has 5 different values nested.

    Things get interesting when you start counting by column:

    SELECT COUNT(*) 
    FROM [realself-main:rs_public.test_count]
    3
    

    That makes sense, the dataset has 3 rows.

    SELECT COUNT(dr_id) 
    FROM [realself-main:rs_public.test_count]
    3
    

    That also makes sense, there are 3 dr_id.

    SELECT COUNT(cover_photos.is_published) 
    FROM [realself-main:rs_public.test_count]
    6
    

    Now things got more interesting. It's 6, because there are 6 values for cover_photos.is_published (the null one doesn't count).

    SELECT COUNT(cover_photos.is_published), COUNT(dr_id)
    FROM [realself-main:rs_public.test_count]
    6   3
    

    This still makes sense: 6 cover_photos.is_published, 3 dr_id.

    SELECT COUNT(*) 
    FROM (
      SELECT cover_photos.is_published, dr_id
      FROM [realself-main:rs_public.test_count]
    )
    3
    

    This is interesting too: If we do a sub-query, COUNT(*) looks at the number of rows returned. There were 3 rows returned. That still makes sense.

    But then:

    SELECT COUNT(*), COUNT(cover_photos.is_published)
    FROM (
      SELECT cover_photos.is_published, dr_id
      FROM [realself-main:rs_public.test_count]
    )
    7   6   
    

    7 and 6. Seven? Why 7?

    Well, BigQuery had to choose a flatten strategy for the subquery. Look at the table I pasted up there - can you see how it kind of has 7 rows? Those are the seven rows counted.

    Let's look at them explicitly:

    SELECT dr_id, cover_photos.is_published
    FROM (
      SELECT cover_photos.is_published, dr_id
      FROM [realself-main:rs_public.test_count]
    )
    

    See? Those are the seven rows. When choosing rows that have nested records (a nice feature for BigQuery), BigQuery sometimes needs to flatten the data to process certain queries. The first 2 rows got flattened to exactly 2 rows (one with a cover_photos.is_published as null), and the 3rd row got flattened to 5 rows, one for each of its cover_photos.is_published.

    The moral of the story is to be careful when working with nested data: Some queries will flatten it in ways that are unexpected to the user, but that make a lot of sense to a computer when its trying to decide.


    By request, let's go deeper:

    Look at the difference between these 2 queries:

    SELECT COUNT(*)
    FROM (
      SELECT *
      FROM (
        SELECT * FROM [realself-main:rs_public.test_count]  
        WHERE is_published
      ) 
    )
    
    SELECT COUNT(*)
    FROM (
      SELECT *
      FROM (
        SELECT * FROM [realself-main:rs_public.test_count]  
      ) 
    )
    WHERE is_published
    

    Before looking at the results, can you guess what results each query will give you? No, you can't. Both queries are ambiguous, so in order to get an answer BigQuery will need to make some guesses and optimizations.

    The result for the first query is 7, and for the second one is 3. Go and try.

    What's the difference? Well, from looking at the results of these queries, I can tell that in the second one BigQuery saw that the only column you are interested in is 'is_published', so it optimizes the tree so only that column is read. But BigQuery has a harder time optimizing the first query - so it guesses "maybe they really want *, and * means I need to flatten the table before passing it to the next layer". It flattens the table, so later the upper most query sees 7 rows.

    Is any of these results guaranteed? No - the queries are ambiguous. How to reduce ambiguity? Instead of using "SELECT *", tell BigQuery which columns you want to look for - so it doesn't need to guess for you.

    0 讨论(0)
  • 2021-01-26 22:57

    If a query can be interpreted in many ways, BigQuery will do its best effort to guess what were your intentions - producing sometimes non-congruent results. This is true to every database, since SQL has space for these ambiguities.

    Solution: Eliminate ambiguity from your queries - probably both results are correct, depending on what you are trying to count.

    (eliminate ambiguity by not using *, and making the prefix explicit, while you could also make a explicit request on which ways you want the table flattened)

    I would really like to comment on your specific data and results, but given that you haven't provided a public sample, I can't.

    0 讨论(0)
  • 2021-01-26 23:07

    I'm adding a new answer, as you keep adding elements to the question - they all deserve a different answer.

    You say this query surprises you:

    SELECT COUNT(*), COUNT(0)
    FROM (
      SELECT dr_id, cover_photos.is_published
      FROM [realself-main:rs_public.test_count] )
    

    You are surprised because the results are 7 and 3.

    Maybe it will make sense if I try this:

    SELECT COUNT(*), COUNT(0), 
           GROUP_CONCAT(STRING(cover_photos.is_published)),
           GROUP_CONCAT(STRING(dr_id)), 
           GROUP_CONCAT(IFNULL(STRING(cover_photos.is_published),'null')),
           GROUP_CONCAT("0")
    FROM (
      SELECT dr_id, cover_photos.is_published
      FROM [realself-main:rs_public.test_count] 
    )
    

    See? It's the same query, plus 4 different aggregations of the same sub-columns, one of which consists of nested repeated data, and that also has a null value in one row.

    The results of the query are:

    7   3   1,1,1,0,0,0 1234,4321,9999  null,1,1,1,0,0,0    0,0,0
    

    The 7 comes from the full expansion of the nested data into 7 rows, as the 5th column hints. The 3 comes from just evaluating "0" three times, as can be seen on the 6th column.

    These subtleties are all related to working with nested repeated data. I'll advise you to not work with nested repeated data until you are ready to accept that these subtleties can happen when working with nested repeated data.

    0 讨论(0)
提交回复
热议问题