I have a data set in the form.
id | attribute
-----------------
1 | a
2 | b
2 | a
2 | a
3 | c
Desired output:
I haven't tested this query, but these functions are supported in Redshift:
select id, arrary_to_string(array(select attribute from mydataset m where m.id=d.id),',')
from mydataset d
This is an answer for the related question here. That question is closed, so I am posting the answer here.
Here is a method to aggregate a column into a string:
select * from temp;
attribute
-----------
a
c
b
1) Give a unique rank to each row
with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select * from sub_table;
attribute | rnk
-----------+-----
a | 1
b | 2
c | 3
2) Use concat operator || to combine in one line
with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select (select attribute from sub_table where rnk = 1)||
(select attribute from sub_table where rnk = 2)||
(select attribute from sub_table where rnk = 3) res_string;
res_string
------------
abc
This only works for a finite numbers of rows (X) in that column. It can be the first X rows ordered by some attribute in the "order by" clause. I'm guessing this is expensive.
Case statement can be used to deal with NULLs which occur when a certain rank does not exist.
with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select (select attribute from sub_table where rnk = 1)||
(select attribute from sub_table where rnk = 2)||
(select attribute from sub_table where rnk = 3)||
(case when (select attribute from sub_table where rnk = 4) is NULL then ''
else (select attribute from sub_table where rnk = 4) end) as res_string;
I found a way to pick up a random attribute for each id, but it's too tricky. Actually I don't think it's a good way, but it works.
SQL:
-- (1) uniq dataset
WITH uniq_dataset as (select * from dataset group by id, attr)
SELECT
uds.id, rds.attr
FROM
-- (2) generate random rank for each id
(select id, round((random() * ((select count(*) from uniq_dataset iuds where iuds.id = ouds.id) - 1))::numeric, 0) + 1 as random_rk from (select distinct id from uniq_dataset) ouds) uds,
-- (3) rank table
(select rank() over(partition by id order by attr) as rk, id ,attr from uniq_dataset) rds
WHERE
uds.id = rds.id
AND
uds.random_rk = rds.rk
ORDER BY
uds.id;
Result:
id | attr
----+------
1 | a
2 | a
3 | c
OR
id | attr
----+------
1 | a
2 | b
3 | c
Here are tables in this SQL.
-- dataset (original table)
id | attr
----+------
1 | a
2 | b
2 | a
2 | a
3 | c
-- (1) uniq dataset
id | attr
----+------
1 | a
2 | a
2 | b
3 | c
-- (2) generate random rank for each id
id | random_rk
----+----
1 | 1
2 | 1 <- 1 or 2
3 | 1
-- (3) rank table
rk | id | attr
----+----+------
1 | 1 | a
1 | 2 | a
2 | 2 | b
1 | 3 | c
This solution, inspired by Masashi, is simpler and accomplishes selecting a random element from a group in Redshift.
SELECT id, first_value as attribute
FROM(SELECT id, FIRST_VALUE(attribute)
OVER(PARTITION BY id ORDER BY random()
ROWS BETWEEN unbounded preceding AND unbounded following)
FROM dataset)
GROUP BY id, attribute ORDER BY id;