Why do I get different outputs for ..agg(countDistinct(\"member_id\") as \"count\")
and ..distinct.count
?
Is the difference the same as between
Why do I get different outputs for ..agg(countDistinct("member_id") as "count") and ..distinct.count?
Because .distinct.count
is the same:
SELECT COUNT(*) FROM (SELECT DISTINCT member_id FROM table)
while ..agg(countDistinct("member_id") as "count")
is
SELECT COUNT(DISTINCT member_id) FROM table
and COUNT(*)
uses different rules than COUNT(column) when nulls are encountered.
1st command :
DF.agg(countDistinct("member_id") as "count")
return the same as that of select count distinct(member_id) from DF
.
2nd command :
DF.distinct.count
is actually getting distinct records or removing al duplicates from the DF and then taking the count.
df.agg(countDistinct("member_id") as "count")
returns the number of distinct values of the member_id
column, ignoring all other columns, while
df.distinct.count
will count the number of distinct records in the DataFrame - where "distinct" means identical in values of all columns.
So, for example, the DataFrame:
+-----------+---------+
|member_name|member_id|
+-----------+---------+
| a| 1|
| b| 1|
| b| 1|
+-----------+---------+
has only one distinct member_id
value but two distinct records, so the agg
option would return 1 while the latter would return 2.