In pyspark, I have a variable length array of doubles for which I would like to find the mean. However, the average function requires a single numeric type.
Is th
In the recent Spark versions (2.4 or later) the most efficient solution is to use aggregate
higher order function:
from pyspark.sql.functions import expr
query = """aggregate(
`{col}`,
CAST(0.0 AS double),
(acc, x) -> acc + x,
acc -> acc / size(`{col}`)
) AS `avg_{col}`""".format(col="longitude")
df.selectExpr("*", query).show()
+--------------------+------------------+
| longitude| avg_longitude|
+--------------------+------------------+
| [-80.9, -82.9]| -81.9|
|[-82.92, -82.93, ...|-82.93166666666667|
| [-82.93, -82.93]| -82.93|
+--------------------+------------------+
See also Spark Scala row-wise average by handling null