I\'m a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range
argument works in a CountVectorizer.
Setting the vocabulary
explicitly means no vocabulary is learned from data. If you don't set it, you get:
>>> v = CountVectorizer(ngram_range=(1, 2))
>>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_)
{u'an': 0,
u'an apple': 1,
u'apple': 2,
u'apple day': 3,
u'away': 4,
u'day': 5,
u'day keeps': 6,
u'doctor': 7,
u'doctor away': 8,
u'keeps': 9,
u'keeps the': 10,
u'the': 11,
u'the doctor': 12}
An explicit vocabulary restricts the terms that will be extracted from text; the vocabulary is not changed:
>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]]) # unigram and bigram found
(Note that stopword filtering is applied before n-gram extraction, hence "apple day"
.)