Calculate the mode of a PySpark DataFrame column?

前端未结

关注

 4  557

Ultimately what I want is the mode of a column, for all the columns in the DataFrame. For other summary statistics, I see a couple of options: use DataFrame aggregation, or

相关标签:

4条回答

一生所求

2021-01-05 16:33
A problem with mode is pretty much the same as with median. While it is easy to compute, computation is rather expensive. It can be done either using sort followed by local and global aggregations or using just-another-wordcount and filter:
```
import numpy as np
np.random.seed(1)

df = sc.parallelize([
    (int(x), ) for x in np.random.randint(50, size=10000)
]).toDF(["x"])

cnts = df.groupBy("x").count()
mode = cnts.join(
    cnts.agg(max("count").alias("max_")), col("count") == col("max_")
).limit(1).select("x")
mode.first()[0]
## 0
```
Either way it may require a full shuffle for each column.
0 讨论(0)
发布评论:

提交评论
- 加载中...

被撕碎了的回忆

2021-01-05 16:33

>>> df=newdata.groupBy('columnName').count()
>>> mode = df.orderBy(df['count'].desc()).collect()[0][0]

See My result

>>> newdata.groupBy('var210').count().show()
+------+-----+
|var210|count|
+------+-----+
|  3av_|   64|
|  7A3j|  509|
|  g5HH| 1489|
|  oT7d|  109|
|  DM_V|  149|
|  uKAI|44883|
+------+-----+

# store the above result in df
>>> df=newdata.groupBy('var210').count()
>>> df.orderBy(df['count'].desc()).collect()
[Row(var210='uKAI', count=44883),
Row(var210='g5HH', count=1489),
Row(var210='7A3j', count=509),
Row(var210='DM_V', count=149),
Row(var210='oT7d', count=109),
Row(var210='3av_', count=64)]

# get the first value using collect()
>>> mode = df.orderBy(df['count'].desc()).collect()[0][0]
>>> mode
'uKAI'

using groupBy() function getting count of each category in column. df is my result data frame has two columns var210,count. using orderBy() with column name 'count' in descending order give the max value in 1st row of data frame. collect()[0][0] is used to get the 1 tuple in data frame

0 讨论(0)

梦谈多话

2021-01-05 16:38

You can calculate column mode using Java code as follows:

            case MODE:
                Dataset<Row> cnts = ds.groupBy(column).count();
                Dataset<Row> dsMode = cnts.join(
                        cnts.agg(functions.max("count").alias("max_")),
                        functions.col("count").equalTo(functions.col("max_")
                        ));
                Dataset<Row> mode = dsMode.limit(1).select(column);
                replaceValue = ((GenericRowWithSchema) mode.first()).values()[0];
                ds = replaceWithValue(ds, column, replaceValue);
                break;

private static Dataset<Row> replaceWithValue(Dataset<Row> ds, String column, Object replaceValue) {
    return ds.withColumn(column,
            functions.coalesce(functions.col(column), functions.lit(replaceValue)));
}

0 讨论(0)

鱼传尺愫

2021-01-05 16:52

This line will give you the mode of "col" in spark data frame df:

df.groupby("col").count().orderBy("count", ascending=False).first()[0]

For a list of modes for all columns in df use:

[df.groupby(i).count().orderBy("count", ascending=False).first()[0] for i in df.columns]

To add names to identify which mode for which column, make 2D list:

[[i,df.groupby(i).count().orderBy("count", ascending=False).first()[0]] for i in df.columns]

0 讨论(0)
发布评论:

提交评论
- 加载中...