redshift: count distinct customers over window partition

前端 未结 2 856
花落未央
花落未央 2021-02-15 01:49

Redshift doesn\'t support DISTINCT aggregates in its window functions. AWS documentation for COUNT states this, and distinct isn\'t supported for any o

2条回答
  •  谎友^
    谎友^ (楼主)
    2021-02-15 02:10

    A blog post from 2016 calls out this problem and provides a rudimentary workaround, so thank you Mark D. Adams. There is strangely very little I could find on all of the web therefore I'm sharing my (tested) solution.

    The key insight is that dense_rank(), ordered by the item in question, provides the same rank to identical items, and therefore the highest rank is also the count of unique items. This is a horrible mess if you try to swap in the following for each partition I want:

    dense_rank() over(partition by order_month, traffic_channel order by customer_id)
    

    Since you need the highest rank, you have to subquery everything and select the max value from each ranking taken. Its important to match the partitions in the outer query to the corresponding partition in the subquery.

    /* multigrain windowed distinct count, additional grains are one dense_rank and one max over() */
    select distinct
           order_month
           , traffic_channel
           , max(tc_mth_rnk) over(partition by order_month, traffic_channel) customers_by_channel_and_month
           , max(tc_rnk) over(partition by traffic_channel)  ytd_customers_by_channel
           , max(mth_rnk) over(partition by order_month)  monthly_customers_all_channels
           , max(cust_rnk) over()  ytd_total_customers
    
    from (
           select order_month
                  , traffic_channel
                  , dense_rank() over(partition by order_month, traffic_channel order by customer_id)  tc_mth_rnk
                  , dense_rank() over(partition by traffic_channel order by customer_id)  tc_rnk
                  , dense_rank() over(partition by order_month order by customer_id)  mth_rnk
                  , dense_rank() over(order by customer_id)  cust_rnk
    
           from orders_traffic_channels
    
           where to_char(order_month, 'YYYY') = '2017'
         )
    
    order by order_month, traffic_channel
    ;
    

    notes

    • partitions of max() and dense_rank() must match
    • dense_rank() will rank null values (all at the same rank, the max). If you want to not count null values you need a case when customer_id is not null then dense_rank() ...etc..., or you can subtract one from the max() if you know there are nulls.

提交回复
热议问题