Spark Dataframe of WrappedArray to Dataframe[Vector]

后端 未结 1 1839
北荒
北荒 2021-01-24 05:16

I have a spark Dataframe df with the following schema:

root
 |-- features: array (nullable = true)
 |             


        
相关标签:
1条回答
  • 2021-01-24 06:17

    It's because .rdd have to unserialize objects from internal in-memory format and it is very time consuming.

    It's ok to use .toArray - you are operating on row level, not collecting everything to the driver node.

    You can do this very easy with UDFs:

    import org.apache.spark.ml.linalg._
    val convertUDF = udf((array : Seq[Double]) => {
      Vectors.dense(array.toArray)
    })
    val withVector = dataset
      .withColumn("features", convertUDF('features))
    

    Code is from this answer: Convert ArrayType(FloatType,false) to VectorUTD

    However there author of the question didn't ask about differences

    0 讨论(0)
提交回复
热议问题