CountVectorizer: “I” not showing up in vectorized text

前端 未结 2 1849
长情又很酷
长情又很酷 2021-02-04 10:02

I\'m new to scikit-learn, and currently studying Naïve Bayes (Multinomial). Right now, I\'m working on vectorizing text from sklearn.feature_extraction.text, and for some reason

2条回答
  •  北海茫月
    2021-02-04 10:50

    This is caused by the default token_pattern for CountVectorizer, which removes tokens of a single character:

    >>> vectorizer_train
    CountVectorizer(analyzer=u'word', binary=False, charset=None,
            charset_error=None, decode_error=u'strict',
            dtype=, encoding=u'utf-8', input=u'content',
            lowercase=True, max_df=1.0, max_features=None, min_df=0,
            ngram_range=(1, 1), preprocessor=None, stop_words=None,
            strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
            tokenizer=None, vocabulary=None)
    >>> pattern = re.compile(vectorizer_train.token_pattern, re.UNICODE)
    >>> print(pattern.match("I"))
    None
    

    To retain "I", use a different pattern, e.g.

    >>> vectorizer_train = CountVectorizer(min_df=0, token_pattern=r"\b\w+\b")
    >>> vectorizer_train.fit(x_train)
    CountVectorizer(analyzer=u'word', binary=False, charset=None,
            charset_error=None, decode_error=u'strict',
            dtype=, encoding=u'utf-8', input=u'content',
            lowercase=True, max_df=1.0, max_features=None, min_df=0,
            ngram_range=(1, 1), preprocessor=None, stop_words=None,
            strip_accents=None, token_pattern='\\b\\w+\\b', tokenizer=None,
            vocabulary=None)
    >>> vectorizer_train.get_feature_names()
    [u'a', u'am', u'hacker', u'i', u'like', u'nigerian', u'puppies']
    

    Note that the non-informative word "a" is now also retained.

提交回复
热议问题