Exactly replicating R text preprocessing in python

后端 未结 2 2026
悲&欢浪女
悲&欢浪女 2021-02-06 12:59

I would like to preprocess a corpus of documents using Python in the same way that I can in R. For example, given an initial corpus, corpus, I would like to end up

2条回答
  •  悲哀的现实
    2021-02-06 13:22

    CountVectorizer and TfidfVectorizer can be customized as described in the docs. In particular, you'll want to write a custom tokenizer, which is a function that takes a document and returns a list of terms. Using NLTK:

    import nltk.corpus.stopwords
    import nltk.stem
    
    def smart_tokenizer(doc):
        doc = doc.lower()
        doc = re.findall(r'\w+', doc, re.UNICODE)
        return [nltk.stem.PorterStemmer().stem(term)
                for term in doc
                if term not in nltk.corpus.stopwords.words('english')]
    

    Demo:

    >>> v = CountVectorizer(tokenizer=smart_tokenizer)
    >>> v.fit_transform([doc]).toarray()
    array([[1, 1, 1, 2, 1, 1, 1, 1, 1]])
    >>> from pprint import pprint
    >>> pprint(v.vocabulary_)
    {u'amaz': 0,
     u'appl': 1,
     u'best': 2,
     u'ear': 3,
     u'ever': 4,
     u'headphon': 5,
     u'pod': 6,
     u'sound': 7,
     u've': 8}
    

    (The example I linked to actually uses a class to cache the lemmatizer, but a function works too.)

提交回复
热议问题