问题
I want to try tfidf with scikit-learn (or nltk or am open to other suggestions). The data I have is a relatively large amount of discussion forum posts (~65k) we have scraped and stored in a mongoDB. Each post has a Post title, Date and Time of post, Text of the post message (or a re: if a reply to an existing post), User name, message ID and whether it is a child or parent post (in a thread, where you have the original post, and then replies to this op, or nested replies, the tree).
I figure each post, would be a separate document, and similar to the 20newsgroups, each document would have the fields I mentioned at the top, and the text of the message post at the bottom which I would extract out of mongo and write into the required format for each text file.
For loading the data into scikit, I know of:
http://scikit-learn.org/dev/modules/generated/sklearn.datasets.load_files.html (but my data is not categorized)
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html - For the input, I know I would be using filenames, but because I would have a large amount of files (each post), is there a way to either have filenames read from a text file? Or is there some example implementation someone could point me towards?
Also, any advice on structuring the filenames for each these discussion forum posts, for later identifying when I get the tfidf vectors and cosine similarity array
Thanks
回答1:
You can pass a python generator or a generator expression of either filenames or string objects instead of a list and thus do the lazy loading of data from the drive as you go. Here is a toy example of a CountVectorizer
taking a generator expression as argument:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> CountVectorizer().fit_transform(('a' * i for i in xrange(100)))
<100x98 sparse matrix of type '<type 'numpy.int64'>'
with 98 stored elements in Compressed Sparse Column format>
Note that generator support can make it possible to vectorize the data directly from a MongoDB query result iterator rather than going though filenames.
Also a list of 65k filenames of 10 chars each is just 650kB in memory (+ the overhead of the python list) so it should not be a problem to load all the filenames ahead of time anyway.
any advice on structuring the filenames for each these discussion forum posts, for later identifying when I get the tfidf vectors and cosine similarity array
Just use a deterministic ordering to be able to sort the list of filenames before feeding them to the vectorizer.
回答2:
I was able to get these tasks.. in case it is helpful, below is the code for specifying a set of text files you want to use, and then how to set the flags and pass the filenames
path = "/wherever/yourfolder/oftextfiles/are"
filenames = os.listdir(path)
filenames.sort()
try:
filenames.remove('.DS_Store') #Because I am on a MAC
except ValueError:
pass # or scream: thing not in some_list!
except AttributeError:
pass # call security, some_list not quacking like a list!
vectorizer = CountVectorizer(input='filename', analyzer='word', strip_accents='unicode', stop_words='english')
X=vectorizer.fit_transform(filenames)
The mongo db part is basic but for what its worth (find all entries of type boardid 10 and sort by the messageid in ascending order):
cursor=coll.find({'boardid': 10 }).sort('messageid', 1)
来源:https://stackoverflow.com/questions/19419245/tf-idf-on-a-somewhat-large-65k-amount-of-text-files