Big Query Deduplication query example explanation

后端 未结 1 1249
遇见更好的自我
遇见更好的自我 2021-01-22 07:00

Anybody can explain this Bigquery query for deduplication? Why do we need to use [OFFSET(0)]? I think it is used to take the first element in aggregation array right? Isn\'t th

相关标签:
1条回答
  • 2021-01-22 07:32

    Let's start with some data we want to de-duplicate:

    WITH table AS (SELECT * FROM UNNEST([STRUCT('001' AS id, 1 AS a, 2 AS b), ('002', 3,5), ('001', 1, 4)]))
    
    SELECT *
    FROM table t
    

    Now, instead of *, I'm going to use t to refer to the whole row:

    SELECT t
    FROM table t
    

    What happens if I group each of these rows by their id:

    SELECT t.id, ARRAY_AGG(t) tt
    FROM table t
    GROUP BY 1
    

    Now I have all the rows with the same id grouped together. But let me choose only one:

    SELECT t.id, ARRAY_AGG(t LIMIT 1) tt
    FROM table t
    GROUP BY 1
    

    That might look good, but that's still one row inside one array. How can I get only the row, and not an array:

    SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
    FROM table t
    GROUP BY 1
    

    And if I want to get back a row without the grouping id, nor the tt prefix:

    SELECT tt.*
    FROM (
      SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
      FROM table t
      GROUP BY 1
    )
    

    And that's how you de-duplicate rows based on the rows ids.

    If you need to choose a particular row - for example the newest one given a timestamp, just order the aggregation like in ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)

    0 讨论(0)
提交回复
热议问题