Big Query Deduplication query example explanation

后端未结

关注

 1  1249

Anybody can explain this Bigquery query for deduplication? Why do we need to use [OFFSET(0)]? I think it is used to take the first element in aggregation array right? Isn\'t th

相关标签:

1条回答

Happy的楠姐

2021-01-22 07:32
Let's start with some data we want to de-duplicate:
```
WITH table AS (SELECT * FROM UNNEST([STRUCT('001' AS id, 1 AS a, 2 AS b), ('002', 3,5), ('001', 1, 4)]))

SELECT *
FROM table t
```
Now, instead of *, I'm going to use t to refer to the whole row:
```
SELECT t
FROM table t
```
What happens if I group each of these rows by their id:
```
SELECT t.id, ARRAY_AGG(t) tt
FROM table t
GROUP BY 1
```
Now I have all the rows with the same id grouped together. But let me choose only one:
```
SELECT t.id, ARRAY_AGG(t LIMIT 1) tt
FROM table t
GROUP BY 1
```
That might look good, but that's still one row inside one array. How can I get only the row, and not an array:
```
SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
FROM table t
GROUP BY 1
```
And if I want to get back a row without the grouping id, nor the tt prefix:
```
SELECT tt.*
FROM (
  SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
  FROM table t
  GROUP BY 1
)
```
And that's how you de-duplicate rows based on the rows ids.

If you need to choose a particular row - for example the newest one given a timestamp, just order the aggregation like in ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)
0 讨论(0)
发布评论:

提交评论
- 加载中...