问题
I am trying to create a term document matrix with NLTK and pandas. I wrote the following function:
def fnDTM_Corpus(xCorpus):
import pandas as pd
'''to create a Term Document Matrix from a NLTK Corpus'''
fd_list = []
for x in range(0, len(xCorpus.fileids())):
fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))
DTM = pd.DataFrame(fd_list, index = xCorpus.fileids())
DTM.fillna(0,inplace = True)
return DTM.T
to run it
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Data/'
newcorpus = PlaintextCorpusReader(corpus_root, '.*')
x = fnDTM_Corpus(newcorpus)
It works well for few small files in the corpus but gives me a MemoryError when I try to run it with a corpus of 4,000 files (of about 2 kb each).
Am I missing something?
I am using a 32 bit python. (am on windows 7, 64-bit OS, Core Quad CPU, 8 GB RAM). Do I really need to use 64 bit for corpus of this size ?
回答1:
I know the OP wanted to create a tdm in NLTK, but the textmining
package (pip install textmining
) makes it dead simple:
import textmining
def termdocumentmatrix_example():
# Create some very short sample documents
doc1 = 'John and Bob are brothers.'
doc2 = 'John went to the store. The store was closed.'
doc3 = 'Bob went to the store too.'
# Initialize class to create term-document matrix
tdm = textmining.TermDocumentMatrix()
# Add the documents
tdm.add_doc(doc1)
tdm.add_doc(doc2)
tdm.add_doc(doc3)
# Write out the matrix to a csv file. Note that setting cutoff=1 means
# that words which appear in 1 or more documents will be included in
# the output (i.e. every word will appear in the output). The default
# for cutoff is 2, since we usually aren't interested in words which
# appear in a single document. For this example we want to see all
# words however, hence cutoff=1.
tdm.write_csv('matrix.csv', cutoff=1)
# Instead of writing out the matrix you can also access its rows directly.
# Let's print them to the screen.
for row in tdm.rows(cutoff=1):
print row
termdocumentmatrix_example()
Output:
['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too']
[1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0]
[0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0]
[0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]
Alternatively, one can use pandas and sklearn [source]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
docs = ['why hello there', 'omg hello pony', 'she went there? omg']
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)
Output:
hello omg pony she there went why
0 1 0 0 0 1 0 1
1 1 1 1 0 0 0 0
2 0 1 0 1 1 1 0
回答2:
Thanks to Radim and Larsmans. My objective was to have a DTM like the one you get in R tm. I decided to use scikit-learn and partly inspired by this blog entry. This the code I came up with.
I post it here in the hope that someone else will find it useful.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
def fn_tdm_df(docs, xColNames = None, **kwargs):
''' create a term document matrix as pandas DataFrame
with **kwargs you can pass arguments of CountVectorizer
if xColNames is given the dataframe gets columns Names'''
#initialize the vectorizer
vectorizer = CountVectorizer(**kwargs)
x1 = vectorizer.fit_transform(docs)
#create dataFrame
df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
if xColNames is not None:
df.columns = xColNames
return df
to use it on a list of text in a directory
DIR = 'C:/Data/'
def fn_CorpusFromDIR(xDIR):
''' functions to create corpus from a Directories
Input: Directory
Output: A dictionary with
Names of files ['ColNames']
the text in corpus ['docs']'''
import os
Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)],
ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR)))
return Res
to create the dataframe
d1 = fn_tdm_df(docs = fn_CorpusFromDIR(DIR)['docs'],
xColNames = fn_CorpusFromDIR(DIR)['ColNames'],
stop_words=None, charset_error = 'replace')
回答3:
An Alternative approach using tokens and Data Frame
import nltk
comment #nltk.download() to get toenize
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)
tokens = nltk.word_tokenize(raw)
type(tokens)
tokens[1:10]
['Project',
'Gutenberg',
'EBook',
'of',
'Crime',
'and',
'Punishment',
',',
'by']
tokens2=pd.DataFrame(tokens)
tokens2.columns=['Words']
tokens2.head()
Words
0 The
1 Project
2 Gutenberg
3 EBook
4 of
tokens2.Words.value_counts().head()
, 16178
. 9589
the 7436
and 6284
to 5278
来源:https://stackoverflow.com/questions/15899861/efficient-term-document-matrix-with-nltk