问题
I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year.
from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped =
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))
The result is :
[students by year][1]
The problem that I discovered that so many ID's are repeated So the result is wrong and huge.
I want to agregate the students by year, count the total number of student by year and ovoid the repetition of ID's.
I hope the question is clear. I'am new member Thanks
回答1:
Use countDistinct function
from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])
gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()
output
+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002| 2|
|2001| 2|
+----+------------------+
回答2:
You can also do:
gr.groupBy("year", "id").count().groupBy("year").count()
This query will return the unique students per year.
来源:https://stackoverflow.com/questions/46421677/how-count-unique-id-after-groupby-in-pyspark