Calculate the mode of a PySpark DataFrame column?

前端未结

关注

 4  559

再見小時候 2021-01-05 15:52

Ultimately what I want is the mode of a column, for all the columns in the DataFrame. For other summary statistics, I see a couple of options: use DataFrame aggregation, or

4条回答

一生所求 (楼主)

2021-01-05 16:33
A problem with mode is pretty much the same as with median. While it is easy to compute, computation is rather expensive. It can be done either using sort followed by local and global aggregations or using just-another-wordcount and filter:
```
import numpy as np
np.random.seed(1)

df = sc.parallelize([
    (int(x), ) for x in np.random.randint(50, size=10000)
]).toDF(["x"])

cnts = df.groupBy("x").count()
mode = cnts.join(
    cnts.agg(max("count").alias("max_")), col("count") == col("max_")
).limit(1).select("x")
mode.first()[0]
## 0
```
Either way it may require a full shuffle for each column.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...