How Count unique ID after groupBy in pyspark

て烟熏妆下的殇ゞ 提交于 2020-04-05 15:41:49

问题


I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year.

from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped = 
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))

The result is :

[students by year][1]

The problem that I discovered that so many ID's are repeated So the result is wrong and huge.

I want to agregate the students by year, count the total number of student by year and ovoid the repetition of ID's.

I hope the question is clear. I'am new member Thanks


回答1:


Use countDistinct function

from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])

gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()

output

+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002|                 2|
|2001|                 2|
+----+------------------+



回答2:


You can also do:

gr.groupBy("year", "id").count().groupBy("year").count()

This query will return the unique students per year.



来源:https://stackoverflow.com/questions/46421677/how-count-unique-id-after-groupby-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!