pyspark - aggregate (sum) vector element-wise

前端未结

关注

 2  934

I have what seems like a simple problem but I keep banging my head against the wall with no success. I am essentially trying to do the same thing as this post except that I

相关标签:

2条回答

隐瞒了意图╮

2021-01-19 16:32

I think you have to cast the vector column to an array before you can aggregate it.

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F
from pyspark.sql import types as T

def vec2array(v):
  v = Vectors.dense(v)
  array = list([float(x) for x in v])
  return array

vec2array_udf = F.udf(vec2array, T.ArrayType(T.FloatType()))

df = df.withColumn('Vec', vec2array_udf('Vec'))

n = len(df.select('Vec').first()[0])
bla = df.agg(F.array(*[F.sum(F.col("Vec")[i]) for i in range(n)]).alias("sum"))
bla.show(truncate=False)

0 讨论(0)

误落风尘

2021-01-19 16:47
I eventually figured this out (I'm lying, one of my coworkers figured it out for me) so I'll post the answer here in case anyone has the same issue.

You can use fold similar to how it's done in the scala example linked in the original question. Syntax in pyspark is like so:
```
# find out how many Xs we're iterating over to establish the range below
vec_df = df.select('Vec')
num_cols = len(vec_df.first().Vec)

# iterate over vector to sum each "column"    
vec_sums = vec_df.rdd.fold([0]*num_cols, lambda a,b: [x + y for x, y in zip(a, b)])
```
Brief explanation: rdd.fold() takes two arguments. The first is an initialization array, in this case [0]*num_cols which is just an array of 0's. The second is a function to apply to the array and to use for iterating over each row of the dataframe. So for each row it does lambda a,b: [x + y for x, y in zip(a, b)] which just adds this row element-wise to what it has computed so far.

You can use my code in the original question to generate a toy dataframe to test this on. Hope that's helpful to someone.
0 讨论(0)
发布评论:

提交评论
- 加载中...