tf-idf on a somewhat large (65k) amount of text files

问题

I want to try tfidf with scikit-learn (or nltk or am open to other suggestions). The data I have is a relatively large amount of discussion forum posts (~65k) we have scraped and stored in a mongoDB. Each post has a Post title, Date and Time of post, Text of the post message (or a re: if a reply to an existing post), User name, message ID and whether it is a child or parent post (in a thread, where you have the original post, and then replies to this op, or nested replies, the tree).

I figure each post, would be a separate document, and similar to the 20newsgroups, each document would have the fields I mentioned at the top, and the text of the message post at the bottom which I would extract out of mongo and write into the required format for each text file.

For loading the data into scikit, I know of:
http://scikit-learn.org/dev/modules/generated/sklearn.datasets.load_files.html (but my data is not categorized) http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html - For the input, I know I would be using filenames, but because I would have a large amount of files (each post), is there a way to either have filenames read from a text file? Or is there some example implementation someone could point me towards?

Also, any advice on structuring the filenames for each these discussion forum posts, for later identifying when I get the tfidf vectors and cosine similarity array

Thanks

回答1:

You can pass a python generator or a generator expression of either filenames or string objects instead of a list and thus do the lazy loading of data from the drive as you go. Here is a toy example of a CountVectorizer taking a generator expression as argument:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> CountVectorizer().fit_transform(('a' * i for i in xrange(100)))
<100x98 sparse matrix of type '<type 'numpy.int64'>'
    with 98 stored elements in Compressed Sparse Column format>

Note that generator support can make it possible to vectorize the data directly from a MongoDB query result iterator rather than going though filenames.

Also a list of 65k filenames of 10 chars each is just 650kB in memory (+ the overhead of the python list) so it should not be a problem to load all the filenames ahead of time anyway.

any advice on structuring the filenames for each these discussion forum posts, for later identifying when I get the tfidf vectors and cosine similarity array

Just use a deterministic ordering to be able to sort the list of filenames before feeding them to the vectorizer.

回答2:

I was able to get these tasks.. in case it is helpful, below is the code for specifying a set of text files you want to use, and then how to set the flags and pass the filenames

path = "/wherever/yourfolder/oftextfiles/are"
filenames = os.listdir(path)
filenames.sort()

try:
    filenames.remove('.DS_Store') #Because I am on a MAC
except ValueError:
    pass # or scream: thing not in some_list!
except AttributeError:
    pass # call security, some_list not quacking like a list!

vectorizer = CountVectorizer(input='filename', analyzer='word', strip_accents='unicode', stop_words='english') 
X=vectorizer.fit_transform(filenames)

The mongo db part is basic but for what its worth (find all entries of type boardid 10 and sort by the messageid in ascending order):

cursor=coll.find({'boardid': 10 }).sort('messageid', 1)

来源：https://stackoverflow.com/questions/19419245/tf-idf-on-a-somewhat-large-65k-amount-of-text-files

标签

nlp

nltk

scikit-learn

tf-idf