How to find average of a array column based on index in pyspark

浪尽此生 提交于 2019-12-11 15:57:41

问题


I have data as shown below

-----------------------------
place  | key        | weights
----------------------------
amazon | lion       | [ 34, 23, 56 ]
north  | bear       | [ 90, 45]
amazon | lion       | [ 38, 30, 50 ]
amazon | bear       | [ 45 ]
amazon | bear       | [ 40 ]

I trying to get the result like below

-----------------------------
place  | key        | average
----------------------------
amazon | lion1      | 36.0      #(34 + 38)/2
amazon | lion2      | 26.5      #(23 + 30)/2
amazon | lion3      | 53.0      #(50 + 56)/2
north  | bear1      | 90        #(90)/1
north  | bear2      | 45        #(45)/1
amazon | bear1      | 42.5      #(45 + 40)/2

I get the point that first I have to do a groupby on columns place and key, and then I have to take average on array elements based on indexes. For example lion1 is 1st index element in arrays [ 34, 23, 56 ] and [ 38, 30, 50 ].

I already has a solution using posexplode, but the problem is in real data weights array column size is very high, as posexplode adds more rows, data size has increased enormously from 10 million rows to 1.2 billion and unable to compute in a reliable time on present cluster.

I think it is better to add more columns than rows and then unpivot the columns, but I have no idea how to achieve that using pyspark or spark SQL 2.2.1 .


回答1:


You can find max number of elements in an array column by functions.size() and then expand that column:

  1. setup the data

    from pyspark.sql import functions as F
    
    df = spark.createDataFrame([    
          ('amazon', 'lion', [ 34, 23, 56 ])
        , ('north',  'bear', [ 90, 45])
        , ('amazon', 'lion', [ 38, 30, 50 ])
        , ('amazon', 'bear', [ 45 ])    
        , ('amazon', 'bear', [ 40 ])
    ], ['place', 'key', 'average'])
    
  2. Find the max number of elements in the array field 'average'

    n = df.select(F.max(F.size('average')).alias('n')).first().n
    
    >>> n
    3
    
  3. Convert array column into n-columns

    df1 = df.select('place', 'key', *[F.col('average')[i].alias('val_{}'.format(i+1)) for i in range(n)])
    
    >>> df1.show()
    +------+----+-----+-----+-----+
    | place| key|val_1|val_2|val_3|
    +------+----+-----+-----+-----+
    |amazon|lion|   34|   23|   56|
    | north|bear|   90|   45| null|
    |amazon|lion|   38|   30|   50|
    |amazon|bear|   45| null| null|
    |amazon|bear|   40| null| null|
    +------+----+-----+-----+-----+
    
  4. Calculate the mean aggregation on the new columns

    df2 = df1.groupby('place', 'key').agg(*[ F.mean('val_{}'.format(i+1)).alias('average_{}'.format(i+1)) for i in range(n)])
    
    >>> df2.show()
    +------+----+---------+---------+---------+
    | place| key|average_1|average_2|average_3|
    +------+----+---------+---------+---------+
    |amazon|bear|     42.5|     null|     null|
    | north|bear|     90.0|     45.0|     null|
    |amazon|lion|     36.0|     26.5|     53.0|
    +------+----+---------+---------+---------+
    
  5. Unpivot the columns using select + union + reduce

    from functools import reduce
    
    df_new = reduce(lambda x,y: x.union(y), [
        df2.select('place', F.concat('key', F.lit(i+1)).alias('key'), F.col('average_{}'.format(i+1)).alias('average')) \
           .dropna(subset=['average']) for i in range(n)
    ])
    
    >>> df_new.show()
    +------+-----+-------+
    | place|  key|average|
    +------+-----+-------+
    |amazon|bear1|   42.5|
    | north|bear1|   90.0|
    |amazon|lion1|   36.0|
    | north|bear2|   45.0|
    |amazon|lion2|   26.5|
    |amazon|lion3|   53.0|
    +------+-----+-------+
    



回答2:


One option is to merge all the arrays for a given place,key combination into an array. On this array of arrays, you can use a udf which computes the desired average and finally posexplode to get the desired result.

from pyspark.sql.functions import collect_list,udf,posexplode,concat
from pyspark.sql.types import ArrayType,DoubleType

#Grouping by place,key to get an array of arrays
grouped_df = df.groupBy(df.place,df.key).agg(collect_list(df.weights).alias('all_weights'))

#Define UDF
zip_mean = udf(lambda args: [sum(i)/len(i) for i in zip(*args)],ArrayType(DoubleType()))

#Apply UDF on the array of array column
res = grouped_df.select('*',zip_mean(grouped_df.all_weights).alias('average'))

#POS explode to explode the average values and get the position for key concatenation
res = res.select('*',posexplode(res.average))

#Final result
res.select(res.place,concat(res.key,res.pos+1).alias('key'),res.col).show()


来源:https://stackoverflow.com/questions/56695395/how-to-find-average-of-a-array-column-based-on-index-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!