Build array based on add and remove event rows in BigQuery

问题

I have a table in BigQuery with the following structure:

datetime | event  | value
==========================
1        | add    | 1
---------+--------+-------
2        | remove | 1
---------+--------+-------
6        | add    | 2
---------+--------+-------
8        | add    | 3
---------+--------+-------
11       | add    | 4
---------+--------+-------
23       | remove | 3
---------+--------+-------

I'm trying to build a view which adds a list column to each row containing the current state of the array. The array will never contain duplicate items. This should be the result:

datetime | event  | value | list
===================================
1        | add    | 1     | [1]
---------+--------+-------+--------
2        | remove | 1     | []
---------+--------+-------+--------
6        | add    | 2     | [2]
---------+--------+-------+--------
8        | add    | 3     | [2,3]
---------+--------+-------+--------
11       | add    | 4     | [2,3,4]
---------+--------+-------+--------
23       | remove | 3     | [2,4]
---------+--------+-------+--------

I tried using analytic functions but it didn't work out. The api to work with arrays is quite limited. I think I would succeed if I could use recursive WITH clauses, unfortunately this is not possible in BigQuery.

I'm using BigQuery with standard SQL enabled.

回答1:

Below version is for BigQuery Standard SQL and uses just pure SQL (no JS UDF)

#standardSQL
WITH `project.dataset.events` AS (
  SELECT 1 dt,'add' event,'1' value UNION ALL
  SELECT 2,   'remove',   '1' UNION ALL
  SELECT 6,   'add',      '2' UNION ALL
  SELECT 8,   'add',      '3' UNION ALL
  SELECT 11,  'add',      '4' UNION ALL
  SELECT 23,  'remove',   '3' 
), cum AS (
  SELECT dt, event, value,
    SUM(IF(event = 'add', 1, -1)) OVER(PARTITION BY value ORDER BY dt) state
  FROM `project.dataset.events`
), pre AS (
  SELECT 
    a.dt, a.event, a.value, a.state, b.value AS b_value,
    ARRAY_AGG(b.state ORDER BY b.dt DESC)[SAFE_OFFSET(0)] b_state, 
    MAX(b.dt) b_dt 
  FROM cum a
  JOIN cum b ON b.dt <= a.dt
  GROUP BY a.dt, a.event, a.value, a.state, b.value
)
SELECT dt, event, value, 
  SPLIT(IFNULL(STRING_AGG(IF(b_state = 1, b_value, NULL) ORDER BY b_dt), '')) list_as_array,
  CONCAT('[', IFNULL(STRING_AGG(IF(b_state = 1, b_value, NULL) ORDER BY b_dt), ''), ']') list_as_string
FROM pre
GROUP BY dt, event, value
ORDER BY dt

result is "surprisingly" :o) exactly the same as in version for JS UDF that I answered/posted previously

Row dt  event   value   list_as_arr list_as_string   
1   1   add     1       1           [1]  
2   2   remove  1                   []   
3   6   add     2       2           [2]  
4   8   add     3       2           [2,3]    
                        3        
5   11  add     4       2           [2,3,4]  
                        3        
                        4        
6   23  remove  3       2           [2,4]    
                        4

Note: I think above might be a little over-engineered - but I just didn't have time to potentially polish / optimize it - should be doable - leaving this up to owner of question

回答2:

Below is for BigQuery SQL (just one of potentially many options)

#standardSQL
CREATE TEMP FUNCTION CUST_ARRAY_AGG(arr ARRAY<STRUCT<event STRING, value STRING>>)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  var result = [];  
  for (i = 0; i < arr.length; i++) { 
    if (arr[i].event == 'add') {
      result.push(arr[i].value);
    } else {
      var index = result.indexOf(arr[i].value);
      if (index > -1) {
        result.splice(index, 1);
      }
    }
  }
  return result;
""";
WITH `project.dataset.events` AS (
  SELECT 1 dt, 'add' event, '1' value UNION ALL
  SELECT 2, 'remove', '1' UNION ALL
  SELECT 6, 'add', '2' UNION ALL
  SELECT 8, 'add', '3' UNION ALL
  SELECT 11, 'add', '4' UNION ALL
  SELECT 23, 'remove', '3' 
)
SELECT dt, event, value, 
  CUST_ARRAY_AGG(arr) list_as_arr,
  (SELECT CONCAT('[', IFNULL(STRING_AGG(item), ''), ']') FROM UNNEST(CUST_ARRAY_AGG(arr)) item) list_as_string
FROM (
  SELECT dt, event, value,
    ARRAY_AGG(STRUCT<event STRING, value STRING>(event, value)) OVER(ORDER BY dt) arr
  FROM `project.dataset.events`
)

with result as below

Row dt  event   value   list_as_arr list_as_string   
1   1   add     1       1           [1]  
2   2   remove  1                   []   
3   6   add     2       2           [2]  
4   8   add     3       2           [2,3]    
                        3        
5   11  add     4       2           [2,3,4]  
                        3        
                        4        
6   23  remove  3       2           [2,4]    
                        4

回答3:

Although already solved I like the problem. The idea in this solution is to first get the full history using windowing, taking into account all preceding rows - and after that removing the remove-items:

#standardSQL
-- stolen test table ;)
WITH test AS (
  SELECT 1 dt,'add' event,'1' value UNION ALL
  SELECT 2,   'remove',   '1' UNION ALL
  SELECT 6,   'add',      '2' UNION ALL
  SELECT 8,   'add',      '3' UNION ALL
  SELECT 11,  'add',      '4' UNION ALL
  SELECT 23,  'remove',   '3' 
)

, windowing as (
SELECT *,
  -- add history using window function
  ARRAY_AGG(STRUCT(event, value)) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as history
FROM test)

SELECT 
  dt,
  event,
  value,
  --history, -- for testing

  -- Get all added items that are not removed items
  -- This sub-select runs within the row only, treating history-array as (sub-)table
  (SELECT ARRAY_AGG(value) value FROM unnest(t.history) l 
    WHERE l.event = 'add' 
    AND l.value NOT IN 
      (SELECT l.value FROM unnest(t.history) l WHERE l.event = 'remove')
  ) AS list
FROM windowing t

I didn't compare performances, but would be interested!

回答4:

All original answers are great (especially mine - lol) and mostly based on set-based (the best way to deal with big data) processing which can in such cases get complex enough for not sql users!

Luckily, support for Scripting and Stored Procedures is now in beta (as of October 2019)

You can submit multiple statements separated with semi-colons and BigQuery is able to run them now. And it is much easier to express needed logic (obviously at least for the price of performance ) in procedural way vs. set-based - so more users can benefit

Below script implements logic expressed in question

DECLARE arr ARRAY<STRUCT<dt INT64, event STRING, value STRING>>;
DECLARE result ARRAY<STRUCT<dt INT64, event STRING, value STRING, list STRING>> DEFAULT [STRUCT(NULL, '', '', '')];
DECLARE list ARRAY<STRING> DEFAULT [];
DECLARE i, m INT64 DEFAULT -1; 

SET arr = (
  WITH t AS (
    SELECT 1 dt,'add' event,'1' value UNION ALL
    SELECT 2,   'remove',   '1' UNION ALL
    SELECT 6,   'add',      '2' UNION ALL
    SELECT 8,   'add',      '3' UNION ALL
    SELECT 11,  'add',      '4' UNION ALL
    SELECT 23,  'remove',   '3' 
  )
  SELECT ARRAY_AGG(t) FROM t
);

SET m = ARRAY_LENGTH(arr);

LOOP
  SET i = i + 1;
  IF i >= m THEN LEAVE;
  ELSE
    IF arr[OFFSET(i)].event = 'add' THEN 
      SET list = (
        SELECT ARRAY_CONCAT(list, [arr[OFFSET(i)].value])
      );    
    ELSE
      SET list = ARRAY(
        SELECT item
        FROM UNNEST(list) item 
        WHERE item != arr[OFFSET(i)].value
      );    
    END IF;

    SET result = (
      SELECT ARRAY_CONCAT(
        result, 
        [(arr[OFFSET(i)].dt, arr[OFFSET(i)].event, arr[OFFSET(i)].value, ARRAY_TO_STRING(list, ','))]
      )
    );
  END IF;
END LOOP;

SELECT * FROM UNNEST(result) WHERE NOT dt IS NULL;

with result

来源：https://stackoverflow.com/questions/48793854/build-array-based-on-add-and-remove-event-rows-in-bigquery

标签

sql

google-cloud-platform

google-bigquery

bigquery-standard-sql