The difference between countDistinct and distinct.count

前端未结

关注

 3  342

Why do I get different outputs for ..agg(countDistinct(\"member_id\") as \"count\") and ..distinct.count? Is the difference the same as between

相关标签:

3条回答

深忆病人

2021-01-15 21:00
Why do I get different outputs for ..agg(countDistinct("member_id") as "count") and ..distinct.count?

Because .distinct.count is the same:
```
SELECT COUNT(*) FROM (SELECT DISTINCT member_id FROM table)
```
while ..agg(countDistinct("member_id") as "count") is
```
SELECT COUNT(DISTINCT member_id) FROM table
```
and COUNT(*) uses different rules than COUNT(column) when nulls are encountered.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2021-01-15 21:05
1st command :
```
DF.agg(countDistinct("member_id") as "count")
```
return the same as that of select count distinct(member_id) from DF.

2nd command :
```
DF.distinct.count
```
is actually getting distinct records or removing al duplicates from the DF and then taking the count.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2021-01-15 21:08
```
df.agg(countDistinct("member_id") as "count")
```
returns the number of distinct values of the member_id column, ignoring all other columns, while
```
df.distinct.count
```
will count the number of distinct records in the DataFrame - where "distinct" means identical in values of all columns.

So, for example, the DataFrame:
```
+-----------+---------+
|member_name|member_id|
+-----------+---------+
|          a|        1|
|          b|        1|
|          b|        1|
+-----------+---------+
```
has only one distinct member_id value but two distinct records, so the agg option would return 1 while the latter would return 2.
0 讨论(0)
发布评论:

提交评论
- 加载中...