Dealing with dynamic columns with VectorAssembler

不问归期 提交于 2019-12-04 21:05:25

If I understand your question right, the answer would be quite easy and straight-forward, you just need to use the .getOutputCol from the previous transformer.

Example (from the official documentation) :

// Prepare training documents from a list of (id, text, label) tuples.
val training = spark.createDataFrame(Seq(
  (0L, "a b c d e spark", 1.0),
  (1L, "b d", 0.0),
  (2L, "spark f g h", 1.0),
  (3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol) // <==== Using the tokenizer output column
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.001)
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lr))

Let's apply this now to a VectorAssembler considering another hypothetical column alpha :

val assembler = new VectorAssembler()
  .setInputCols(Array("alpha", tokenizer.getOutputCol)
  .setOutputCol("features")

I created a custom vector assembler (1:1 copy of original) and then changed it to include all columns except some which are passed to be excluded.

edit

To make it a bit clearer

def setInputColsExcept(value: Array[String]): this.type = set(inputCols, value)

specifies which columns should be excluded. And then

val remainingColumns = dataset.columns.filter(!$(inputCols).contains(_))

in the transform method is filtering for desired columns.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!