Understanding the `ngram_range` argument in a CountVectorizer in sklearn

前端 未结 1 800
野的像风
野的像风 2021-01-30 21:36

I\'m a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer.

相关标签:
1条回答
  • 2021-01-30 21:53

    Setting the vocabulary explicitly means no vocabulary is learned from data. If you don't set it, you get:

    >>> v = CountVectorizer(ngram_range=(1, 2))
    >>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_)
    {u'an': 0,
     u'an apple': 1,
     u'apple': 2,
     u'apple day': 3,
     u'away': 4,
     u'day': 5,
     u'day keeps': 6,
     u'doctor': 7,
     u'doctor away': 8,
     u'keeps': 9,
     u'keeps the': 10,
     u'the': 11,
     u'the doctor': 12}
    

    An explicit vocabulary restricts the terms that will be extracted from text; the vocabulary is not changed:

    >>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
    >>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
    array([[1, 1]])  # unigram and bigram found
    

    (Note that stopword filtering is applied before n-gram extraction, hence "apple day".)

    0 讨论(0)
提交回复
热议问题