How to make VectorAssembler do not compress data?

僤鯓⒐⒋嵵緔 提交于 2021-01-28 05:32:51

问题


I want to transform multiple columns to one column using VectorAssembler,but the data is compressed by default without other options.

val arr2= Array((1,2,0,0,0),(1,2,3,0,0),(1,2,4,5,0),(1,2,2,5,6))
val df=sc.parallelize(arr2).toDF("a","b","c","e","f")
val colNames=Array("a","b","c","e","f")
val assembler = new VectorAssembler()
  .setInputCols(colNames)
  .setOutputCol("newCol")
val transDF= assembler.transform(df).select(col("newCol"))
transDF.show(false)

The input is:

  +---+---+---+---+---+
  |  a|  b|  c|  e|  f|
  +---+---+---+---+---+
  |  1|  2|  0|  0|  0|
  |  1|  2|  3|  0|  0|
  |  1|  2|  4|  5|  0|
  |  1|  2|  2|  5|  6|
  +---+---+---+---+---+

The result is:

+---------------------+
|newCol               |
+---------------------+
|(5,[0,1],[1.0,2.0])  |
|[1.0,2.0,3.0,0.0,0.0]|
|[1.0,2.0,4.0,5.0,0.0]|
|[1.0,2.0,2.0,5.0,6.0]|
+---------------------+

My expect result is:

+---------------------+
|newCol               |
+---------------------+
|[1.0,2.0,0.0,0.0,0.0]|
|[1.0,2.0,3.0,0.0,0.0]|
|[1.0,2.0,4.0,5.0,0.0]|
|[1.0,2.0,2.0,5.0,6.0]|
+---------------------+

What should I do to get my expect result?


回答1:


If you really want to coerce all vectors to their dense representation, you can do it using a User Defined Function :

val toDense = udf((v: org.apache.spark.ml.linalg.Vector) => v.toDense)
transDF.select(toDense($"newCol")).show

+--------------------+
|         UDF(newCol)|
+--------------------+
|[1.0,2.0,0.0,0.0,...|
|[1.0,2.0,3.0,0.0,...|
|[1.0,2.0,4.0,5.0,...|
|[1.0,2.0,2.0,5.0,...|
+--------------------+


来源:https://stackoverflow.com/questions/48517220/how-to-make-vectorassembler-do-not-compress-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!