efficient Term Document Matrix with NLTK

后端 未结 3 1808
温柔的废话
温柔的废话 2020-12-24 04:15

I am trying to create a term document matrix with NLTK and pandas. I wrote the following function:

def fnDTM_Corpus(xCorpus):
    import pandas as pd
    \'         


        
3条回答
  •  时光说笑
    2020-12-24 05:02

    An Alternative approach using tokens and Data Frame

    import nltk
    comment #nltk.download() to get toenize
    from urllib import request
    url = "http://www.gutenberg.org/files/2554/2554-0.txt"
    response = request.urlopen(url)
    raw = response.read().decode('utf8')
    type(raw)
    
    tokens = nltk.word_tokenize(raw)
    type(tokens)
    
    tokens[1:10]
    ['Project',
     'Gutenberg',
     'EBook',
     'of',
     'Crime',
     'and',
     'Punishment',
     ',',
     'by']
    
    tokens2=pd.DataFrame(tokens)
    tokens2.columns=['Words']
    tokens2.head()
    
    
    Words
    0   The
    1   Project
    2   Gutenberg
    3   EBook
    4   of
    
        tokens2.Words.value_counts().head()
    ,                 16178
    .                  9589
    the                7436
    and                6284
    to                 5278
    

提交回复
热议问题