问题
I want to transform multiple columns to one column using VectorAssembler
,but the data is compressed by default without other options.
val arr2= Array((1,2,0,0,0),(1,2,3,0,0),(1,2,4,5,0),(1,2,2,5,6))
val df=sc.parallelize(arr2).toDF("a","b","c","e","f")
val colNames=Array("a","b","c","e","f")
val assembler = new VectorAssembler()
.setInputCols(colNames)
.setOutputCol("newCol")
val transDF= assembler.transform(df).select(col("newCol"))
transDF.show(false)
The input is:
+---+---+---+---+---+
| a| b| c| e| f|
+---+---+---+---+---+
| 1| 2| 0| 0| 0|
| 1| 2| 3| 0| 0|
| 1| 2| 4| 5| 0|
| 1| 2| 2| 5| 6|
+---+---+---+---+---+
The result is:
+---------------------+
|newCol |
+---------------------+
|(5,[0,1],[1.0,2.0]) |
|[1.0,2.0,3.0,0.0,0.0]|
|[1.0,2.0,4.0,5.0,0.0]|
|[1.0,2.0,2.0,5.0,6.0]|
+---------------------+
My expect result is:
+---------------------+
|newCol |
+---------------------+
|[1.0,2.0,0.0,0.0,0.0]|
|[1.0,2.0,3.0,0.0,0.0]|
|[1.0,2.0,4.0,5.0,0.0]|
|[1.0,2.0,2.0,5.0,6.0]|
+---------------------+
What should I do to get my expect result?
回答1:
If you really want to coerce all vectors to their dense representation, you can do it using a User Defined Function :
val toDense = udf((v: org.apache.spark.ml.linalg.Vector) => v.toDense)
transDF.select(toDense($"newCol")).show
+--------------------+
| UDF(newCol)|
+--------------------+
|[1.0,2.0,0.0,0.0,...|
|[1.0,2.0,3.0,0.0,...|
|[1.0,2.0,4.0,5.0,...|
|[1.0,2.0,2.0,5.0,...|
+--------------------+
来源:https://stackoverflow.com/questions/48517220/how-to-make-vectorassembler-do-not-compress-data