问题
I have a table in BigQuery with the following structure:
datetime | event | value
==========================
1 | add | 1
---------+--------+-------
2 | remove | 1
---------+--------+-------
6 | add | 2
---------+--------+-------
8 | add | 3
---------+--------+-------
11 | add | 4
---------+--------+-------
23 | remove | 3
---------+--------+-------
I'm trying to build a view which adds a list
column to each row containing the current state of the array. The array will never contain duplicate items. This should be the result:
datetime | event | value | list
===================================
1 | add | 1 | [1]
---------+--------+-------+--------
2 | remove | 1 | []
---------+--------+-------+--------
6 | add | 2 | [2]
---------+--------+-------+--------
8 | add | 3 | [2,3]
---------+--------+-------+--------
11 | add | 4 | [2,3,4]
---------+--------+-------+--------
23 | remove | 3 | [2,4]
---------+--------+-------+--------
I tried using analytic functions but it didn't work out. The api to work with arrays is quite limited. I think I would succeed if I could use recursive WITH
clauses, unfortunately this is not possible in BigQuery.
I'm using BigQuery with standard SQL enabled.
回答1:
Below version is for BigQuery Standard SQL and uses just pure SQL (no JS UDF)
#standardSQL
WITH `project.dataset.events` AS (
SELECT 1 dt,'add' event,'1' value UNION ALL
SELECT 2, 'remove', '1' UNION ALL
SELECT 6, 'add', '2' UNION ALL
SELECT 8, 'add', '3' UNION ALL
SELECT 11, 'add', '4' UNION ALL
SELECT 23, 'remove', '3'
), cum AS (
SELECT dt, event, value,
SUM(IF(event = 'add', 1, -1)) OVER(PARTITION BY value ORDER BY dt) state
FROM `project.dataset.events`
), pre AS (
SELECT
a.dt, a.event, a.value, a.state, b.value AS b_value,
ARRAY_AGG(b.state ORDER BY b.dt DESC)[SAFE_OFFSET(0)] b_state,
MAX(b.dt) b_dt
FROM cum a
JOIN cum b ON b.dt <= a.dt
GROUP BY a.dt, a.event, a.value, a.state, b.value
)
SELECT dt, event, value,
SPLIT(IFNULL(STRING_AGG(IF(b_state = 1, b_value, NULL) ORDER BY b_dt), '')) list_as_array,
CONCAT('[', IFNULL(STRING_AGG(IF(b_state = 1, b_value, NULL) ORDER BY b_dt), ''), ']') list_as_string
FROM pre
GROUP BY dt, event, value
ORDER BY dt
result is "surprisingly" :o) exactly the same as in version for JS UDF that I answered/posted previously
Row dt event value list_as_arr list_as_string
1 1 add 1 1 [1]
2 2 remove 1 []
3 6 add 2 2 [2]
4 8 add 3 2 [2,3]
3
5 11 add 4 2 [2,3,4]
3
4
6 23 remove 3 2 [2,4]
4
Note: I think above might be a little over-engineered - but I just didn't have time to potentially polish / optimize it - should be doable - leaving this up to owner of question
回答2:
Below is for BigQuery SQL (just one of potentially many options)
#standardSQL
CREATE TEMP FUNCTION CUST_ARRAY_AGG(arr ARRAY<STRUCT<event STRING, value STRING>>)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
var result = [];
for (i = 0; i < arr.length; i++) {
if (arr[i].event == 'add') {
result.push(arr[i].value);
} else {
var index = result.indexOf(arr[i].value);
if (index > -1) {
result.splice(index, 1);
}
}
}
return result;
""";
WITH `project.dataset.events` AS (
SELECT 1 dt, 'add' event, '1' value UNION ALL
SELECT 2, 'remove', '1' UNION ALL
SELECT 6, 'add', '2' UNION ALL
SELECT 8, 'add', '3' UNION ALL
SELECT 11, 'add', '4' UNION ALL
SELECT 23, 'remove', '3'
)
SELECT dt, event, value,
CUST_ARRAY_AGG(arr) list_as_arr,
(SELECT CONCAT('[', IFNULL(STRING_AGG(item), ''), ']') FROM UNNEST(CUST_ARRAY_AGG(arr)) item) list_as_string
FROM (
SELECT dt, event, value,
ARRAY_AGG(STRUCT<event STRING, value STRING>(event, value)) OVER(ORDER BY dt) arr
FROM `project.dataset.events`
)
with result as below
Row dt event value list_as_arr list_as_string
1 1 add 1 1 [1]
2 2 remove 1 []
3 6 add 2 2 [2]
4 8 add 3 2 [2,3]
3
5 11 add 4 2 [2,3,4]
3
4
6 23 remove 3 2 [2,4]
4
回答3:
Although already solved I like the problem. The idea in this solution is to first get the full history using windowing, taking into account all preceding rows - and after that removing the remove-items:
#standardSQL
-- stolen test table ;)
WITH test AS (
SELECT 1 dt,'add' event,'1' value UNION ALL
SELECT 2, 'remove', '1' UNION ALL
SELECT 6, 'add', '2' UNION ALL
SELECT 8, 'add', '3' UNION ALL
SELECT 11, 'add', '4' UNION ALL
SELECT 23, 'remove', '3'
)
, windowing as (
SELECT *,
-- add history using window function
ARRAY_AGG(STRUCT(event, value)) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as history
FROM test)
SELECT
dt,
event,
value,
--history, -- for testing
-- Get all added items that are not removed items
-- This sub-select runs within the row only, treating history-array as (sub-)table
(SELECT ARRAY_AGG(value) value FROM unnest(t.history) l
WHERE l.event = 'add'
AND l.value NOT IN
(SELECT l.value FROM unnest(t.history) l WHERE l.event = 'remove')
) AS list
FROM windowing t
I didn't compare performances, but would be interested!
回答4:
All original answers are great (especially mine - lol) and mostly based on set-based (the best way to deal with big data) processing which can in such cases get complex enough for not sql users!
Luckily, support for Scripting and Stored Procedures is now in beta (as of October 2019)
You can submit multiple statements separated with semi-colons and BigQuery is able to run them now. And it is much easier to express needed logic (obviously at least for the price of performance ) in procedural way vs. set-based - so more users can benefit
Below script implements logic expressed in question
DECLARE arr ARRAY<STRUCT<dt INT64, event STRING, value STRING>>;
DECLARE result ARRAY<STRUCT<dt INT64, event STRING, value STRING, list STRING>> DEFAULT [STRUCT(NULL, '', '', '')];
DECLARE list ARRAY<STRING> DEFAULT [];
DECLARE i, m INT64 DEFAULT -1;
SET arr = (
WITH t AS (
SELECT 1 dt,'add' event,'1' value UNION ALL
SELECT 2, 'remove', '1' UNION ALL
SELECT 6, 'add', '2' UNION ALL
SELECT 8, 'add', '3' UNION ALL
SELECT 11, 'add', '4' UNION ALL
SELECT 23, 'remove', '3'
)
SELECT ARRAY_AGG(t) FROM t
);
SET m = ARRAY_LENGTH(arr);
LOOP
SET i = i + 1;
IF i >= m THEN LEAVE;
ELSE
IF arr[OFFSET(i)].event = 'add' THEN
SET list = (
SELECT ARRAY_CONCAT(list, [arr[OFFSET(i)].value])
);
ELSE
SET list = ARRAY(
SELECT item
FROM UNNEST(list) item
WHERE item != arr[OFFSET(i)].value
);
END IF;
SET result = (
SELECT ARRAY_CONCAT(
result,
[(arr[OFFSET(i)].dt, arr[OFFSET(i)].event, arr[OFFSET(i)].value, ARRAY_TO_STRING(list, ','))]
)
);
END IF;
END LOOP;
SELECT * FROM UNNEST(result) WHERE NOT dt IS NULL;
with result
来源:https://stackoverflow.com/questions/48793854/build-array-based-on-add-and-remove-event-rows-in-bigquery