Count distinct in window functions

问题

I was trying to count of unique column b for each c, with out doing group by. I know this could be done with join. how to do count(distinct b) over (partition by c) with out resorting to join. Why are count distinct not supported in window functions. Thank you in advance. Given this data frame:

val df= Seq(("a1","b1","c1"),
                ("a2","b2","c1"),
                ("a3","b3","c1"),
                ("a31",null,"c1"),
                ("a32",null,"c1"),
                ("a4","b4","c11"),
                ("a5","b5","c11"),
                ("a6","b6","c11"),
                ("a7","b1","c2"),
                ("a8","b1","c3"),
                ("a9","b1","c4"),
                ("a91","b1","c5"),
                ("a92","b1","c5"),
                ("a93","b1","c5"),
                ("a95","b2","c6"),
                ("a96","b2","c6"),
                ("a97","b1","c6"),
                ("a977",null,"c6"),
                ("a98",null,"c8"),
                ("a99",null,"c8"),
                ("a999",null,"c8")
                ).toDF("a","b","c");

回答1:

Some databases do support count(distinct) as a window function. There are two alternatives. One is the sum of dense ranks:

select (dense_rank() over (partition by c order by b asc) +
        dense_rank() over (partition by c order by b desc) -
        1
       ) as count_distinct
from t;

The second uses a subquery:

select sum(case when seqnum = 1 then 1 else 0 end) over (partition by c)
from (select t.*, row_number() over (partition by c order by b) as seqnum
      from t
     ) t;

回答2:

count of unique column b for each c without doing group by.

A typical SQL workaround is to use a subquery that selects distincts tuples, and then a window count in the outer query:

SELECT c, COUNT(*) OVER(PARTITION BY c) cnt
FROM (SELECT DISTINCT b, c FROM mytable) x

来源：https://stackoverflow.com/questions/58349076/count-distinct-in-window-functions

标签

sql

apache-spark

apache-spark-sql