Handle unseen categorical string Spark CountVectorizer

问题

I have seen StringIndexer has problems with unseen labels (see here).

My question are:

Does CountVectorizer have the same limitation? How does it treat a string not in the vocabulary?
Moreover, is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter?
Last, from ML point of view, assuming a simple classifier such as Logistic Regression, shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some sort of default prediction?

回答1:

Does CountVectorizer have the same limitation? How does it treat a string not in the vocabulary?

It doesn't care about unseen values.

is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter?

Size of the vector cannot exceed vocabulary size and is further limited by the number of the distinct values.

shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some

This exactly what happens. Problem is slightly more complicated though. StringIndexer is typically paired with OneHotEncoder which by default encodes the base category as a vector of zeros to avoid dummy variable trap. So using the same approach with indexing would be ambiguous.

To illustrate all the points consider following example:

import org.apache.spark.ml.feature.CountVectorizer

val train = Seq(Seq("foo"), Seq("bar")).toDF("text")
val test = Seq(Seq("foo"), Seq("foobar")).toDF("text")

// 
val vectorizer = new CountVectorizer().setInputCol("text")

vectorizer.setVocabSize(1000).fit(train).vocabulary
// Array[String] = Array(foo, bar)

/* Vocabulary size is truncated to the value 
provided by VocabSize Param */

vectorizer.setVocabSize(1).fit(train).vocabulary
// Array[String] = Array(bar)

/* Unseen values are ignored and if there are no known values
we get vector of zeros ((2,[],[])) */

vectorizer.setVocabSize(1000).fit(train).transform(test).show
// +--------+---------------------------+
// |    text|cntVec_0a49b1315206__output|
// +--------+---------------------------+
// |   [foo]|              (2,[1],[1.0])|
// |[foobar]|                  (2,[],[])|
// +--------+---------------------------+

来源：https://stackoverflow.com/questions/39546671/handle-unseen-categorical-string-spark-countvectorizer

标签

apache-spark

pyspark

categorical-data