The difference between countDistinct and distinct.count

前端 未结 3 343
借酒劲吻你
借酒劲吻你 2021-01-15 20:09

Why do I get different outputs for ..agg(countDistinct(\"member_id\") as \"count\") and ..distinct.count? Is the difference the same as between

3条回答
  •  一整个雨季
    2021-01-15 21:08

    df.agg(countDistinct("member_id") as "count")
    

    returns the number of distinct values of the member_id column, ignoring all other columns, while

    df.distinct.count
    

    will count the number of distinct records in the DataFrame - where "distinct" means identical in values of all columns.

    So, for example, the DataFrame:

    +-----------+---------+
    |member_name|member_id|
    +-----------+---------+
    |          a|        1|
    |          b|        1|
    |          b|        1|
    +-----------+---------+
    

    has only one distinct member_id value but two distinct records, so the agg option would return 1 while the latter would return 2.

提交回复
热议问题