Handle unseen categorical string Spark CountVectorizer

只谈情不闲聊 提交于 2020-01-24 16:32:21

问题


I have seen StringIndexer has problems with unseen labels (see here).

My question are:

  1. Does CountVectorizer have the same limitation? How does it treat a string not in the vocabulary?

  2. Moreover, is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter?

  3. Last, from ML point of view, assuming a simple classifier such as Logistic Regression, shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some sort of default prediction?


回答1:


Does CountVectorizer have the same limitation? How does it treat a string not in the vocabulary?

It doesn't care about unseen values.

is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter?

Size of the vector cannot exceed vocabulary size and is further limited by the number of the distinct values.

shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some

This exactly what happens. Problem is slightly more complicated though. StringIndexer is typically paired with OneHotEncoder which by default encodes the base category as a vector of zeros to avoid dummy variable trap. So using the same approach with indexing would be ambiguous.

To illustrate all the points consider following example:

import org.apache.spark.ml.feature.CountVectorizer

val train = Seq(Seq("foo"), Seq("bar")).toDF("text")
val test = Seq(Seq("foo"), Seq("foobar")).toDF("text")

// 
val vectorizer = new CountVectorizer().setInputCol("text")

vectorizer.setVocabSize(1000).fit(train).vocabulary
// Array[String] = Array(foo, bar)

/* Vocabulary size is truncated to the value 
provided by VocabSize Param */

vectorizer.setVocabSize(1).fit(train).vocabulary
// Array[String] = Array(bar)

/* Unseen values are ignored and if there are no known values
we get vector of zeros ((2,[],[])) */

vectorizer.setVocabSize(1000).fit(train).transform(test).show
// +--------+---------------------------+
// |    text|cntVec_0a49b1315206__output|
// +--------+---------------------------+
// |   [foo]|              (2,[1],[1.0])|
// |[foobar]|                  (2,[],[])|
// +--------+---------------------------+


来源:https://stackoverflow.com/questions/39546671/handle-unseen-categorical-string-spark-countvectorizer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!