Anybody can explain this Bigquery query for deduplication? Why do we need to use [OFFSET(0)]? I think it is used to take the first element in aggregation array right? Isn\'t th
Let's start with some data we want to de-duplicate:
WITH table AS (SELECT * FROM UNNEST([STRUCT('001' AS id, 1 AS a, 2 AS b), ('002', 3,5), ('001', 1, 4)]))
SELECT *
FROM table t
Now, instead of *
, I'm going to use t
to refer to the whole row:
SELECT t
FROM table t
What happens if I group each of these rows by their id:
SELECT t.id, ARRAY_AGG(t) tt
FROM table t
GROUP BY 1
Now I have all the rows with the same id grouped together. But let me choose only one:
SELECT t.id, ARRAY_AGG(t LIMIT 1) tt
FROM table t
GROUP BY 1
That might look good, but that's still one row inside one array. How can I get only the row, and not an array:
SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
FROM table t
GROUP BY 1
And if I want to get back a row without the grouping id
, nor the tt
prefix:
SELECT tt.*
FROM (
SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
FROM table t
GROUP BY 1
)
And that's how you de-duplicate rows based on the rows ids.
If you need to choose a particular row - for example the newest one given a timestamp, just order the aggregation like in ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)