How to obtain the average of an array-type column in scala-spark over all row entries per entry?

回眸只為那壹抹淺笑 提交于 2021-02-10 04:56:51

问题


I got an array column with 512 double elements, and want to get the average. Take an array column with length=3 as example:

val x = Seq("2 4 6", "0 0 0").toDF("value").withColumn("value", split($"value", " "))
x.printSchema()
x.show()


root
 |-- value: array (nullable = true)
 |    |-- element: string (containsNull = true)

+---------+
|    value|
+---------+
|[2, 4, 6]|
|[0, 0, 0]|
+---------+

The following result is desired:

x.select(..... as "avg_value").show()

------------
|avg_value |
------------
|[1,2,3]   |
------------

回答1:


Consider each array element as column and calculate average then construct array with those columns:

val array_size = 3
val avgAgg = for (i <- 0 to array_size -1) yield avg($"value".getItem(i))
df.select(array(avgAgg: _*).alias("avg_value")).show(false)

Gives:

+---------------+
|avg_value      |
+---------------+
|[1.0, 2.0, 3.0]|
+---------------+



回答2:


This should do the trick for a constant sized array:

from pyspark.sql.functions import col, avg, array
df = spark.createDataFrame([ list([[x,x+1,x+2]]) for x in range(3)], ['value'])
num_array_elements = len(df.select("value").first()[0])
df.agg(array(*[avg(col("value")[i]) for i in range(num_array_elements)]).alias("avgValuesPerElement")).show()

returns:

+------------------+
|avgValesPerElement|
+------------------+
|   [1.0, 2.0, 3.0]|
+------------------+

Thought I read pyspark. Leaving for pysparkers.



来源:https://stackoverflow.com/questions/59532225/how-to-obtain-the-average-of-an-array-type-column-in-scala-spark-over-all-row-en

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!