Pick a random attribute from group in Redshift

后端 未结 4 1496
攒了一身酷
攒了一身酷 2021-01-23 04:37

I have a data set in the form.

id  |   attribute
-----------------
1   |   a
2   |   b
2   |   a
2   |   a
3   |   c

Desired output:

相关标签:
4条回答
  • 2021-01-23 05:03

    I haven't tested this query, but these functions are supported in Redshift:

    select id, arrary_to_string(array(select attribute from mydataset m where m.id=d.id),',') from mydataset d

    0 讨论(0)
  • 2021-01-23 05:04

    This is an answer for the related question here. That question is closed, so I am posting the answer here.

    Here is a method to aggregate a column into a string:

    select * from temp;
     attribute 
    -----------
     a
     c
     b
    

    1) Give a unique rank to each row

    with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
    select * from sub_table;
    
     attribute | rnk 
    -----------+-----
     a         |   1
     b         |   2
     c         |   3
    

    2) Use concat operator || to combine in one line

    with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
    select (select attribute from sub_table where rnk = 1)||
           (select attribute from sub_table where rnk = 2)||
           (select attribute from sub_table where rnk = 3) res_string;
    
     res_string 
    ------------
     abc
    

    This only works for a finite numbers of rows (X) in that column. It can be the first X rows ordered by some attribute in the "order by" clause. I'm guessing this is expensive.

    Case statement can be used to deal with NULLs which occur when a certain rank does not exist.

    with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
    select (select attribute from sub_table where rnk = 1)||
           (select attribute from sub_table where rnk = 2)||
           (select attribute from sub_table where rnk = 3)||
           (case when (select attribute from sub_table where rnk = 4) is NULL then '' 
                 else (select attribute from sub_table where rnk = 4) end) as res_string;
    
    0 讨论(0)
  • 2021-01-23 05:06

    I found a way to pick up a random attribute for each id, but it's too tricky. Actually I don't think it's a good way, but it works.

    SQL:

    -- (1) uniq dataset 
    WITH uniq_dataset as (select * from dataset group by id, attr)
    SELECT 
      uds.id, rds.attr
    FROM
    -- (2) generate random rank for each id
      (select id, round((random() * ((select count(*) from uniq_dataset iuds where iuds.id = ouds.id) - 1))::numeric, 0) + 1 as random_rk from (select distinct id from uniq_dataset) ouds) uds,
    -- (3) rank table
      (select rank() over(partition by id order by attr) as rk, id ,attr from uniq_dataset) rds
    WHERE
      uds.id = rds.id
    AND 
      uds.random_rk = rds.rk
    ORDER BY
      uds.id;
    

    Result:

     id | attr
    ----+------
      1 | a
      2 | a
      3 | c
    
    OR
    
     id | attr
    ----+------
      1 | a
      2 | b
      3 | c
    

    Here are tables in this SQL.

    -- dataset (original table)
     id | attr
    ----+------
      1 | a
      2 | b
      2 | a
      2 | a
      3 | c
    
    -- (1) uniq dataset
     id | attr
    ----+------
      1 | a
      2 | a
      2 | b
      3 | c
    
    -- (2) generate random rank for each id
     id | random_rk
    ----+----
      1 |  1
      2 |  1 <- 1 or 2
      3 |  1
    
    -- (3) rank table
     rk | id | attr
    ----+----+------
      1 |  1 | a
      1 |  2 | a
      2 |  2 | b
      1 |  3 | c
    
    0 讨论(0)
  • 2021-01-23 05:06

    This solution, inspired by Masashi, is simpler and accomplishes selecting a random element from a group in Redshift.

    SELECT id, first_value as attribute 
    FROM(SELECT id, FIRST_VALUE(attribute) 
        OVER(PARTITION BY id ORDER BY random() 
        ROWS BETWEEN unbounded preceding AND unbounded following) 
        FROM dataset) 
    GROUP BY id, attribute ORDER BY id;
    
    0 讨论(0)
提交回复
热议问题