问题
I have a spark df
spark_df = spark.createDataFrame(
[(1, 7, 'foo'),
(2, 6, 'bar'),
(3, 4, 'foo'),
(4, 8, 'bar'),
(5, 1, 'bar')
],
['v1', 'v2', 'id']
)
Expected Output
id avg(v1) avg(v2) min(v1) min(v2) 0.25(v1) 0.25(v2) 0.5(v1) 0.5(v2)
0 bar 3.666667 5.0 2 1 some-value some-value some-value some-value
1 foo 2.000000 5.5 1 4. some-value some-value some-value some-value
Until, now I can achieve the basic stats like avg, min, max. But not able to get the quantiles. I know ,this can be achieved easily in Pandas but not able to get it done in Pyspark
Also, I knew about approxQuantile, but I am not able to combine basic stasts along with quantiles in pyspark
Until now, I can get the basic stats like mean and min by using agg. Also I want quantiles in the same df
func = [F.mean, F.min,]
NUMERICAL_FEATURE_LIST = ['v1', 'v2']
GROUP_BY_FIELDS = ['id']
exp = [f(F.col(c)) for f in func for c in NUMERICAL_FEATURE_LIST]
df_fin = spark_df.groupby(*GROUP_BY_FIELDS).agg(*exp)
回答1:
Perhaps this is helpful-
val spark_df = Seq((1, 7, "foo"),
(2, 6, "bar"),
(3, 4, "foo"),
(4, 8, "bar"),
(5, 1, "bar")
).toDF("v1", "v2", "id")
spark_df.show(false)
spark_df.printSchema()
spark_df.summary() // default= "count", "mean", "stddev", "min", "25%", "50%", "75%", "max"
.show(false)
/**
* +---+---+---+
* |v1 |v2 |id |
* +---+---+---+
* |1 |7 |foo|
* |2 |6 |bar|
* |3 |4 |foo|
* |4 |8 |bar|
* |5 |1 |bar|
* +---+---+---+
*
* root
* |-- v1: integer (nullable = false)
* |-- v2: integer (nullable = false)
* |-- id: string (nullable = true)
*
* +-------+------------------+------------------+----+
* |summary|v1 |v2 |id |
* +-------+------------------+------------------+----+
* |count |5 |5 |5 |
* |mean |3.0 |5.2 |null|
* |stddev |1.5811388300841898|2.7748873851023217|null|
* |min |1 |1 |bar |
* |25% |2 |4 |null|
* |50% |3 |6 |null|
* |75% |4 |7 |null|
* |max |5 |8 |foo |
* +-------+------------------+------------------+----+
*/
if you need in the format, then use below answer.
回答2:
I think a syntax like this is what you're looking for:
spark.createOrRegisterTempTable("spark_table")
spark.sql("SELECT id, AVG(v1) AS avg_v1, AVG(v2) AS avg_v2, \
MIN(v1) AS min_v1, MIN(v2) AS min_v2, \
percentile_approx(v1, 0.25) AS p25_v1, percentile_approx(v2, 0.25) AS p25_v2, \
percentile_approx(v1, 0.5)AS p50_v1, percentile_approx(v2, 0.5) AS p50_v2 \
FROM spark_table GROUP BY id").show(5)
It helps to create aliases because unformatted column names are a pain to work with.
回答3:
Method describe computes the statistics like mean, min, max etc for the numeric columns in the dataframe.
df.describe().show()
来源:https://stackoverflow.com/questions/62366103/pyspark-how-to-get-basic-stats-mean-min-max-along-with-quantiles-25-50