问题
I have seen StringIndexer
has problems with unseen labels (see here).
My question are:
Does
CountVectorizer
have the same limitation? How does it treat a string not in the vocabulary?Moreover, is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter?
Last, from ML point of view, assuming a simple classifier such as Logistic Regression, shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some sort of default prediction?
回答1:
Does CountVectorizer have the same limitation? How does it treat a string not in the vocabulary?
It doesn't care about unseen values.
is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter?
Size of the vector cannot exceed vocabulary size and is further limited by the number of the distinct values.
shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some
This exactly what happens. Problem is slightly more complicated though. StringIndexer
is typically paired with OneHotEncoder
which by default encodes the base category as a vector of zeros to avoid dummy variable trap. So using the same approach with indexing would be ambiguous.
To illustrate all the points consider following example:
import org.apache.spark.ml.feature.CountVectorizer
val train = Seq(Seq("foo"), Seq("bar")).toDF("text")
val test = Seq(Seq("foo"), Seq("foobar")).toDF("text")
//
val vectorizer = new CountVectorizer().setInputCol("text")
vectorizer.setVocabSize(1000).fit(train).vocabulary
// Array[String] = Array(foo, bar)
/* Vocabulary size is truncated to the value
provided by VocabSize Param */
vectorizer.setVocabSize(1).fit(train).vocabulary
// Array[String] = Array(bar)
/* Unseen values are ignored and if there are no known values
we get vector of zeros ((2,[],[])) */
vectorizer.setVocabSize(1000).fit(train).transform(test).show
// +--------+---------------------------+
// | text|cntVec_0a49b1315206__output|
// +--------+---------------------------+
// | [foo]| (2,[1],[1.0])|
// |[foobar]| (2,[],[])|
// +--------+---------------------------+
来源:https://stackoverflow.com/questions/39546671/handle-unseen-categorical-string-spark-countvectorizer