spark pipeline vector assembler drop other columns

回眸只為那壹抹淺笑 提交于 2019-12-25 04:38:12

问题


A spark VectorAssembler http://spark.apache.org/docs/latest/ml-features.html#vectorassembler produces the following output

id | hour | mobile | userFeatures     | clicked | features
----|------|--------|------------------|---------|-----------------------------
 0  | 18   | 1.0    | [0.0, 10.0, 0.5] | 1.0     | [18.0, 1.0, 0.0, 10.0, 0.5]

as you can see the last column contains all the previous features. Is it better / more performant if the other columns are removed e.g. only the label/id and features are retained or is this an unnecessary overhead and just feeding label/id and features into the estimator is enough?

What happens when the VectorAssembler is used in a pipeline? will only the last features be used or will it introduce colinearity (duplicate columns) if the original columns are not removed manually?


回答1:


Please read carefully the documentation. Every classifier is parametrized by the features column (featuresCol). It doesn't consider any other column or the order of columns.



来源:https://stackoverflow.com/questions/40536335/spark-pipeline-vector-assembler-drop-other-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!