问题
A spark VectorAssembler
http://spark.apache.org/docs/latest/ml-features.html#vectorassembler produces the following output
id | hour | mobile | userFeatures | clicked | features
----|------|--------|------------------|---------|-----------------------------
0 | 18 | 1.0 | [0.0, 10.0, 0.5] | 1.0 | [18.0, 1.0, 0.0, 10.0, 0.5]
as you can see the last column contains all the previous features. Is it better / more performant if the other columns are removed e.g. only the label/id and features are retained or is this an unnecessary overhead and just feeding label/id and features into the estimator is enough?
What happens when the VectorAssembler
is used in a pipeline? will only the last features be used or will it introduce colinearity (duplicate columns) if the original columns are not removed manually?
回答1:
Please read carefully the documentation. Every classifier is parametrized by the features column (featuresCol
). It doesn't consider any other column or the order of columns.
来源:https://stackoverflow.com/questions/40536335/spark-pipeline-vector-assembler-drop-other-columns