问题
I have a table storing array elements by the array they belong to and their index in the array. It seemed smart because the arrays were expected to be sparse, and have their elements updated individually. Let's say this is the table:
CREATE TABLE values (
pk TEXT,
i INTEGER,
value REAL,
PRIMARY KEY (pk, i)
);
pk | i | value
----+---+-------
A | 0 | 17.5
A | 1 | 32.7
A | 3 | 5.3
B | 1 | 13.5
B | 2 | 4.8
B | 4 | 89.1
Now I would like to get these as real arrays, i.e. {17.5, 32.7, NULL, 53}
for A and {NULL, 13.5, 4.8, NULL, 89.1}
for B.
I would have expected that it's easily possible with a grouping query
and an appropriate aggregate function. However, it turned out that there
is no such function that puts elements into an array by its index (or
subscript, as postgres calls it). It would've been much simpler if the
elements were successive - I just could've used array_agg
with
ORDER BY i
. But I want the null values in the result
arrays.
What I ended up with was this monster:
SELECT
pk,
ARRAY( SELECT
( SELECT value
FROM values innervals
WHERE innervals.pk = outervals.pk AND i = generate_series
)
FROM generate_series(0, MAX(i))
ORDER BY generate_series -- is this really necessary?
)
FROM values outervals
GROUP BY pk;
Having to SELECT … FROM values
twice is ugly, and the query planner doesn't seem to be able to optimise this.
Is there a simple way to refer to the grouped rows as a relation in a subquery, so that I could just SELECT value FROM generate_series(0, MAX(i)) LEFT JOIN ???
?
Would it be more appropriate to solve this by defining a custom aggregate function?
Edit: It seems what I was looking for is possible with multiple-argument unnest
and array_agg
, although it is not particularly elegant:
SELECT
pk,
ARRAY( SELECT val
FROM generate_series(0, MAX(i)) AS series (series_i)
LEFT OUTER JOIN
unnest( array_agg(value ORDER BY i),
array_agg(i ORDER BY i) ) AS arr (val, arr_i)
ON arr_i = series_i
ORDER BY series_i
)
FROM values
GROUP BY pk;
The query planner even seems to realise that it can do a sorted merge , although I need to put some more effort in really understanding the JOIN
on the sorted series_i
and arr_i
EXPLAIN
output. Edit 2: It's actually a hash join between series_i
and arr_i
, only the outer group aggregation uses a "sorted" strategy.
回答1:
Not sure if this qualifies as "simpler" - I personally find it easier to follow though:
with idx as (
select pk,
generate_series(0, max(i)) as i
from "values"
group by pk
)
select idx.pk,
array_agg(v.value order by idx.i) as vals
from idx
left join "values" v on v.i = idx.i and v.pk = idx.pk
group by idx.pk;
The CTE idx
generates all possible index values for each PK
values and then uses that to aggregate the values
Online example
回答2:
Would it be more appropriate to solve this by defining a custom aggregate function?
It does at least simplify the query significantly:
SELECT pk, array_by_subscript(i+1, value)
FROM "values"
GROUP BY pk;
Using
CREATE FUNCTION array_set(arr anyarray, index int, val anyelement) RETURNS anyarray
AS $$
BEGIN
arr[index] = val;
RETURN arr;
END
$$ LANGUAGE plpgsql STRICT;
CREATE FUNCTION array_fillup(arr anyarray) RETURNS anyarray
AS $$
BEGIN
-- necessary for nice to_json conversion of arrays that don't start at subscript 1
IF array_lower(arr, 1) > 1 THEN
arr[1] = NULL;
END IF;
RETURN arr;
END
$$ LANGUAGE plpgsql STRICT;
CREATE AGGREGATE array_by_subscript(int, anyelement) (
sfunc = array_set,
stype = anyarray,
initcond = '{}',
finalfunc = array_fillup
);
Online example. It also has a nice query plan that does a simple linear scan on the values
, I'll have to benchmark how efficient .array_set
is at growing the array
This is in fact the fastest solution, according to an EXPLAIN ANALYZE
benchmark on a reasonably-sized test data set. It took 55ms, compared to about 80ms of the ARRAY + UNNEST
solution, and is considerably faster than the 160ms of the join against the common table expression.
回答3:
I think this qualifies as a solution (much better than my original attempt) so I'll post it as an answer. From this answer I realised that I can indeed put multiple values in the array_agg
by using record syntax, it only forces me to declare the types in the column definition:
SELECT
pk,
ARRAY( SELECT val
FROM generate_series(0, MAX(i)) AS series (series_i)
LEFT OUTER JOIN
unnest(array_agg( (value, i) )) AS arr (val real, arr_i integer)
-- ^^^^^^^^^^ ^^^^ ^^^^^^^
ON arr_i = series_i
ORDER BY series_i
)
FROM values
GROUP BY pk;
It still uses a hash left join followed by sorting instead of a sorting followed by a merge join, but maybe the query planner does optimisation better than my naive assumption.
来源:https://stackoverflow.com/questions/58026300/how-to-get-arrays-from-a-normalised-table-that-stores-array-elements-by-index