pyspark - aggregate (sum) vector element-wise

前端 未结 2 924
南方客
南方客 2021-01-19 15:56

I have what seems like a simple problem but I keep banging my head against the wall with no success. I am essentially trying to do the same thing as this post except that I

2条回答
  •  隐瞒了意图╮
    2021-01-19 16:32

    I think you have to cast the vector column to an array before you can aggregate it.

    from pyspark.ml.linalg import Vectors, VectorUDT
    from pyspark.sql import functions as F
    from pyspark.sql import types as T
    
    def vec2array(v):
      v = Vectors.dense(v)
      array = list([float(x) for x in v])
      return array
    
    vec2array_udf = F.udf(vec2array, T.ArrayType(T.FloatType()))
    
    df = df.withColumn('Vec', vec2array_udf('Vec'))
    
    n = len(df.select('Vec').first()[0])
    bla = df.agg(F.array(*[F.sum(F.col("Vec")[i]) for i in range(n)]).alias("sum"))
    bla.show(truncate=False)
    

提交回复
热议问题