efficient Term Document Matrix with NLTK

后端 未结 3 1809
温柔的废话
温柔的废话 2020-12-24 04:15

I am trying to create a term document matrix with NLTK and pandas. I wrote the following function:

def fnDTM_Corpus(xCorpus):
    import pandas as pd
    \'         


        
相关标签:
3条回答
  • 2020-12-24 04:51

    I know the OP wanted to create a tdm in NLTK, but the textmining package (pip install textmining) makes it dead simple:

    import textmining
        
    # Create some very short sample documents
    doc1 = 'John and Bob are brothers.'
    doc2 = 'John went to the store. The store was closed.'
    doc3 = 'Bob went to the store too.'
    
    # Initialize class to create term-document matrix
    tdm = textmining.TermDocumentMatrix()
    
    # Add the documents
    tdm.add_doc(doc1)
    tdm.add_doc(doc2)
    tdm.add_doc(doc3)
    
    # Write matrix file -- cutoff=1 means words in 1+ documents are retained
    tdm.write_csv('matrix.csv', cutoff=1)
    
    # Instead of writing the matrix, access its rows directly
    for row in tdm.rows(cutoff=1):
        print row
    

    Output:

    ['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too']
    [1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0]
    [0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0]
    [0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]
    

    Alternatively, one can use pandas and sklearn [source]:

    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer
    
    docs = ['why hello there', 'omg hello pony', 'she went there? omg']
    vec = CountVectorizer()
    X = vec.fit_transform(docs)
    df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
    print(df)
    

    Output:

       hello  omg  pony  she  there  went  why
    0      1    0     0    0      1     0    1
    1      1    1     1    0      0     0    0
    2      0    1     0    1      1     1    0
    
    0 讨论(0)
  • 2020-12-24 05:02

    An Alternative approach using tokens and Data Frame

    import nltk
    comment #nltk.download() to get toenize
    from urllib import request
    url = "http://www.gutenberg.org/files/2554/2554-0.txt"
    response = request.urlopen(url)
    raw = response.read().decode('utf8')
    type(raw)
    
    tokens = nltk.word_tokenize(raw)
    type(tokens)
    
    tokens[1:10]
    ['Project',
     'Gutenberg',
     'EBook',
     'of',
     'Crime',
     'and',
     'Punishment',
     ',',
     'by']
    
    tokens2=pd.DataFrame(tokens)
    tokens2.columns=['Words']
    tokens2.head()
    
    
    Words
    0   The
    1   Project
    2   Gutenberg
    3   EBook
    4   of
    
        tokens2.Words.value_counts().head()
    ,                 16178
    .                  9589
    the                7436
    and                6284
    to                 5278
    
    0 讨论(0)
  • 2020-12-24 05:11

    Thanks to Radim and Larsmans. My objective was to have a DTM like the one you get in R tm. I decided to use scikit-learn and partly inspired by this blog entry. This the code I came up with.

    I post it here in the hope that someone else will find it useful.

    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer 
    
    def fn_tdm_df(docs, xColNames = None, **kwargs):
        ''' create a term document matrix as pandas DataFrame
        with **kwargs you can pass arguments of CountVectorizer
        if xColNames is given the dataframe gets columns Names'''
    
        #initialize the  vectorizer
        vectorizer = CountVectorizer(**kwargs)
        x1 = vectorizer.fit_transform(docs)
        #create dataFrame
        df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
        if xColNames is not None:
            df.columns = xColNames
    
        return df
    

    to use it on a list of text in a directory

    DIR = 'C:/Data/'
    
    def fn_CorpusFromDIR(xDIR):
        ''' functions to create corpus from a Directories
        Input: Directory
        Output: A dictionary with 
                 Names of files ['ColNames']
                 the text in corpus ['docs']'''
        import os
        Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)],
                   ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR)))
        return Res
    

    to create the dataframe

    d1 = fn_tdm_df(docs = fn_CorpusFromDIR(DIR)['docs'],
              xColNames = fn_CorpusFromDIR(DIR)['ColNames'], 
              stop_words=None, charset_error = 'replace')  
    
    0 讨论(0)
提交回复
热议问题