Find mean of pyspark array

后端未结

关注

 2  886

醉酒成梦 2021-01-17 18:31

In pyspark, I have a variable length array of doubles for which I would like to find the mean. However, the average function requires a single numeric type.

Is th

2条回答

花落未央 (楼主)

2021-01-17 19:04

In the recent Spark versions (2.4 or later) the most efficient solution is to use aggregate higher order function:

from pyspark.sql.functions import expr

query = """aggregate(
    `{col}`,
    CAST(0.0 AS double),
    (acc, x) -> acc + x,
    acc -> acc / size(`{col}`)
) AS  `avg_{col}`""".format(col="longitude")

df.selectExpr("*", query).show()

+--------------------+------------------+
|           longitude|     avg_longitude|
+--------------------+------------------+
|      [-80.9, -82.9]|             -81.9|
|[-82.92, -82.93, ...|-82.93166666666667|
|    [-82.93, -82.93]|            -82.93|
+--------------------+------------------+

See also Spark Scala row-wise average by handling null

0 讨论(0)

查看其它2个回答