How to get arrays from a normalised table that stores array elements by index?

问题

I have a table storing array elements by the array they belong to and their index in the array. It seemed smart because the arrays were expected to be sparse, and have their elements updated individually. Let's say this is the table:

CREATE TABLE values (
    pk TEXT,
    i INTEGER,
    value REAL,
    PRIMARY KEY (pk, i)
);

 pk | i | value
----+---+-------
 A  | 0 | 17.5
 A  | 1 | 32.7
 A  | 3 | 5.3
 B  | 1 | 13.5
 B  | 2 | 4.8
 B  | 4 | 89.1

Now I would like to get these as real arrays, i.e. {17.5, 32.7, NULL, 53} for A and {NULL, 13.5, 4.8, NULL, 89.1} for B.

I would have expected that it's easily possible with a grouping query and an appropriate aggregate function. However, it turned out that there is no such function that puts elements into an array by its index (or subscript, as postgres calls it). It would've been much simpler if the elements were successive - I just could've used array_agg with ORDER BY i. But I want the null values in the result arrays.

What I ended up with was this monster:

SELECT
  pk,
  ARRAY( SELECT
    ( SELECT value
      FROM values innervals
      WHERE innervals.pk = outervals.pk AND i = generate_series
    )
    FROM generate_series(0, MAX(i))
    ORDER BY generate_series -- is this really necessary?
  )
FROM values outervals
GROUP BY pk;

Having to SELECT … FROM values twice is ugly, and the query planner doesn't seem to be able to optimise this.

Is there a simple way to refer to the grouped rows as a relation in a subquery, so that I could just SELECT value FROM generate_series(0, MAX(i)) LEFT JOIN ????

Would it be more appropriate to solve this by defining a custom aggregate function?

Edit: It seems what I was looking for is possible with multiple-argument unnest and array_agg, although it is not particularly elegant:

SELECT
  pk,
  ARRAY( SELECT val
    FROM generate_series(0, MAX(i)) AS series (series_i)
    LEFT OUTER JOIN
      unnest( array_agg(value ORDER BY i),
              array_agg(i ORDER BY i) ) AS arr (val, arr_i)
      ON arr_i = series_i
    ORDER BY series_i
  )
FROM values
GROUP BY pk;

The query planner even seems to ~~realise that it can do a sorted merge JOIN on the sorted series_i and arr_i~~, although I need to put some more effort in really understanding the EXPLAIN output. Edit 2: It's actually a hash join between series_i and arr_i, only the outer group aggregation uses a "sorted" strategy.

回答1:

Not sure if this qualifies as "simpler" - I personally find it easier to follow though:

with idx as (
  select pk, 
         generate_series(0, max(i)) as i
  from "values"
  group by pk
)
select idx.pk, 
       array_agg(v.value order by idx.i) as vals
from idx 
  left join "values" v on v.i = idx.i and v.pk = idx.pk
group by idx.pk;

The CTE idx generates all possible index values for each PK values and then uses that to aggregate the values

Online example

回答2:

Would it be more appropriate to solve this by defining a custom aggregate function?

It does at least simplify the query significantly:

SELECT pk, array_by_subscript(i+1, value)
FROM "values"
GROUP BY pk;

Using

CREATE FUNCTION array_set(arr anyarray, index int, val anyelement) RETURNS anyarray
AS $$
BEGIN
    arr[index] = val;
    RETURN arr;
END
$$ LANGUAGE plpgsql STRICT;

CREATE FUNCTION array_fillup(arr anyarray) RETURNS anyarray
AS $$
BEGIN
   -- necessary for nice to_json conversion of arrays that don't start at subscript 1
   IF array_lower(arr, 1) > 1 THEN
       arr[1] = NULL;
   END IF;
   RETURN arr;
END
$$ LANGUAGE plpgsql STRICT;

CREATE AGGREGATE array_by_subscript(int, anyelement) (
 sfunc = array_set,
 stype = anyarray,
 initcond = '{}',
 finalfunc = array_fillup
);

Online example. It also has a nice query plan that does a simple linear scan on the values, ~~I'll have to benchmark how efficient array_set is at growing the array~~.
This is in fact the fastest solution, according to an EXPLAIN ANALYZE benchmark on a reasonably-sized test data set. It took 55ms, compared to about 80ms of the ARRAY + UNNEST solution, and is considerably faster than the 160ms of the join against the common table expression.

回答3:

I think this qualifies as a solution (much better than my original attempt) so I'll post it as an answer. From this answer I realised that I can indeed put multiple values in the array_agg by using record syntax, it only forces me to declare the types in the column definition:

SELECT
  pk,
  ARRAY( SELECT val
    FROM generate_series(0, MAX(i)) AS series (series_i)
    LEFT OUTER JOIN
      unnest(array_agg( (value, i) )) AS arr (val real, arr_i integer)
--                      ^^^^^^^^^^                ^^^^        ^^^^^^^
      ON arr_i = series_i
    ORDER BY series_i
  )
FROM values
GROUP BY pk;

It still uses a hash left join followed by sorting instead of a sorting followed by a merge join, but maybe the query planner does optimisation better than my naive assumption.

来源：https://stackoverflow.com/questions/58026300/how-to-get-arrays-from-a-normalised-table-that-stores-array-elements-by-index

标签

sql

arrays

postgresql

aggregate-functions

generate-series