The difference between countDistinct and distinct.count

前端 未结 3 342
借酒劲吻你
借酒劲吻你 2021-01-15 20:09

Why do I get different outputs for ..agg(countDistinct(\"member_id\") as \"count\") and ..distinct.count? Is the difference the same as between

相关标签:
3条回答
  • 2021-01-15 21:00

    Why do I get different outputs for ..agg(countDistinct("member_id") as "count") and ..distinct.count?

    Because .distinct.count is the same:

    SELECT COUNT(*) FROM (SELECT DISTINCT member_id FROM table)
    

    while ..agg(countDistinct("member_id") as "count") is

    SELECT COUNT(DISTINCT member_id) FROM table
    

    and COUNT(*) uses different rules than COUNT(column) when nulls are encountered.

    0 讨论(0)
  • 2021-01-15 21:05

    1st command :

    DF.agg(countDistinct("member_id") as "count")
    

    return the same as that of select count distinct(member_id) from DF.

    2nd command :

    DF.distinct.count
    

    is actually getting distinct records or removing al duplicates from the DF and then taking the count.

    0 讨论(0)
  • 2021-01-15 21:08
    df.agg(countDistinct("member_id") as "count")
    

    returns the number of distinct values of the member_id column, ignoring all other columns, while

    df.distinct.count
    

    will count the number of distinct records in the DataFrame - where "distinct" means identical in values of all columns.

    So, for example, the DataFrame:

    +-----------+---------+
    |member_name|member_id|
    +-----------+---------+
    |          a|        1|
    |          b|        1|
    |          b|        1|
    +-----------+---------+
    

    has only one distinct member_id value but two distinct records, so the agg option would return 1 while the latter would return 2.

    0 讨论(0)
提交回复
热议问题