Exactly replicating R text preprocessing in python

后端 未结 2 2021
悲&欢浪女
悲&欢浪女 2021-02-06 12:59

I would like to preprocess a corpus of documents using Python in the same way that I can in R. For example, given an initial corpus, corpus, I would like to end up

相关标签:
2条回答
  • 2021-02-06 13:22

    CountVectorizer and TfidfVectorizer can be customized as described in the docs. In particular, you'll want to write a custom tokenizer, which is a function that takes a document and returns a list of terms. Using NLTK:

    import nltk.corpus.stopwords
    import nltk.stem
    
    def smart_tokenizer(doc):
        doc = doc.lower()
        doc = re.findall(r'\w+', doc, re.UNICODE)
        return [nltk.stem.PorterStemmer().stem(term)
                for term in doc
                if term not in nltk.corpus.stopwords.words('english')]
    

    Demo:

    >>> v = CountVectorizer(tokenizer=smart_tokenizer)
    >>> v.fit_transform([doc]).toarray()
    array([[1, 1, 1, 2, 1, 1, 1, 1, 1]])
    >>> from pprint import pprint
    >>> pprint(v.vocabulary_)
    {u'amaz': 0,
     u'appl': 1,
     u'best': 2,
     u'ear': 3,
     u'ever': 4,
     u'headphon': 5,
     u'pod': 6,
     u'sound': 7,
     u've': 8}
    

    (The example I linked to actually uses a class to cache the lemmatizer, but a function works too.)

    0 讨论(0)
  • 2021-02-06 13:25

    It seems tricky to get things exactly the same between nltk and tm on the preprocessing steps, so I think the best approach is to use rpy2 to run the preprocessing in R and pull the results into python:

    import rpy2.robjects as ro
    preproc = [x[0] for x in ro.r('''
    tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
    library(tm)
    library(SnowballC)
    corpus = Corpus(VectorSource(tweets$Tweet))
    corpus = tm_map(corpus, tolower)
    corpus = tm_map(corpus, removePunctuation)
    corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
    corpus = tm_map(corpus, stemDocument)''')]
    

    Then, you can load it into scikit-learn -- the only thing you'll need to do to get things to match between the CountVectorizer and the DocumentTermMatrix is to remove terms of length less than 3:

    from sklearn.feature_extraction.text import CountVectorizer
    def mytokenizer(x):
        return [y for y in x.split() if len(y) > 2]
    
    # Full document-term matrix
    cv = CountVectorizer(tokenizer=mytokenizer)
    X = cv.fit_transform(preproc)
    X
    # <1181x3289 sparse matrix of type '<type 'numpy.int64'>'
    #   with 8980 stored elements in Compressed Sparse Column format>
    
    # Sparse terms removed
    cv2 = CountVectorizer(tokenizer=mytokenizer, min_df=0.005)
    X2 = cv2.fit_transform(preproc)
    X2
    # <1181x309 sparse matrix of type '<type 'numpy.int64'>'
    #   with 4669 stored elements in Compressed Sparse Column format>
    

    Let's verify this matches with R:

    tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
    library(tm)
    library(SnowballC)
    corpus = Corpus(VectorSource(tweets$Tweet))
    corpus = tm_map(corpus, tolower)
    corpus = tm_map(corpus, removePunctuation)
    corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
    corpus = tm_map(corpus, stemDocument)
    dtm = DocumentTermMatrix(corpus)
    dtm
    # A document-term matrix (1181 documents, 3289 terms)
    # 
    # Non-/sparse entries: 8980/3875329
    # Sparsity           : 100%
    # Maximal term length: 115 
    # Weighting          : term frequency (tf)
    
    sparse = removeSparseTerms(dtm, 0.995)
    sparse
    # A document-term matrix (1181 documents, 309 terms)
    # 
    # Non-/sparse entries: 4669/360260
    # Sparsity           : 99%
    # Maximal term length: 20 
    # Weighting          : term frequency (tf)
    

    As you can see, the number of stored elements and terms exactly match between the two approaches now.

    0 讨论(0)
提交回复
热议问题