Issue with UDF on a column of Vectors in PySpark DataFrame

后端 未结 1 1686
庸人自扰
庸人自扰 2021-01-14 21:10

I am having trouble using a UDF on a column of Vectors in PySpark which can be illustrated here:

from pyspark import S         


        
相关标签:
1条回答
  • 2021-01-14 22:06

    In spark-sql, vectors are treated (type, size, indices, value) tuple.

    You can use udf on vectors with pyspark. Just modify some code to work with values in vector type.

    vector_udf = udf(lambda vector: sum(vector[3]), DoubleType())
    
    df.withColumn('feature_sums', vector_udf(df.features)).first()
    

    https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

    0 讨论(0)
提交回复
热议问题