I have a spark Dataframe df
with the following schema:
root
|-- features: array (nullable = true)
|
It's because .rdd
have to unserialize objects from internal in-memory format and it is very time consuming.
It's ok to use .toArray
- you are operating on row level, not collecting everything to the driver node.
You can do this very easy with UDFs:
import org.apache.spark.ml.linalg._
val convertUDF = udf((array : Seq[Double]) => {
Vectors.dense(array.toArray)
})
val withVector = dataset
.withColumn("features", convertUDF('features))
Code is from this answer: Convert ArrayType(FloatType,false) to VectorUTD
However there author of the question didn't ask about differences