efficient Term Document Matrix with NLTK

后端 未结 3 1807
温柔的废话
温柔的废话 2020-12-24 04:15

I am trying to create a term document matrix with NLTK and pandas. I wrote the following function:

def fnDTM_Corpus(xCorpus):
    import pandas as pd
    \'         


        
3条回答
  •  醉梦人生
    2020-12-24 04:51

    I know the OP wanted to create a tdm in NLTK, but the textmining package (pip install textmining) makes it dead simple:

    import textmining
        
    # Create some very short sample documents
    doc1 = 'John and Bob are brothers.'
    doc2 = 'John went to the store. The store was closed.'
    doc3 = 'Bob went to the store too.'
    
    # Initialize class to create term-document matrix
    tdm = textmining.TermDocumentMatrix()
    
    # Add the documents
    tdm.add_doc(doc1)
    tdm.add_doc(doc2)
    tdm.add_doc(doc3)
    
    # Write matrix file -- cutoff=1 means words in 1+ documents are retained
    tdm.write_csv('matrix.csv', cutoff=1)
    
    # Instead of writing the matrix, access its rows directly
    for row in tdm.rows(cutoff=1):
        print row
    

    Output:

    ['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too']
    [1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0]
    [0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0]
    [0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]
    

    Alternatively, one can use pandas and sklearn [source]:

    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer
    
    docs = ['why hello there', 'omg hello pony', 'she went there? omg']
    vec = CountVectorizer()
    X = vec.fit_transform(docs)
    df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
    print(df)
    

    Output:

       hello  omg  pony  she  there  went  why
    0      1    0     0    0      1     0    1
    1      1    1     1    0      0     0    0
    2      0    1     0    1      1     1    0
    

提交回复
热议问题