tf-idf on a somewhat large (65k) amount of text files

十年热恋 提交于 2021-02-08 04:45:37

问题


I want to try tfidf with scikit-learn (or nltk or am open to other suggestions). The data I have is a relatively large amount of discussion forum posts (~65k) we have scraped and stored in a mongoDB. Each post has a Post title, Date and Time of post, Text of the post message (or a re: if a reply to an existing post), User name, message ID and whether it is a child or parent post (in a thread, where you have the original post, and then replies to this op, or nested replies, the tree).

I figure each post, would be a separate document, and similar to the 20newsgroups, each document would have the fields I mentioned at the top, and the text of the message post at the bottom which I would extract out of mongo and write into the required format for each text file.

For loading the data into scikit, I know of:
http://scikit-learn.org/dev/modules/generated/sklearn.datasets.load_files.html (but my data is not categorized) http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html - For the input, I know I would be using filenames, but because I would have a large amount of files (each post), is there a way to either have filenames read from a text file? Or is there some example implementation someone could point me towards?

Also, any advice on structuring the filenames for each these discussion forum posts, for later identifying when I get the tfidf vectors and cosine similarity array

Thanks


回答1:


You can pass a python generator or a generator expression of either filenames or string objects instead of a list and thus do the lazy loading of data from the drive as you go. Here is a toy example of a CountVectorizer taking a generator expression as argument:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> CountVectorizer().fit_transform(('a' * i for i in xrange(100)))
<100x98 sparse matrix of type '<type 'numpy.int64'>'
    with 98 stored elements in Compressed Sparse Column format>

Note that generator support can make it possible to vectorize the data directly from a MongoDB query result iterator rather than going though filenames.

Also a list of 65k filenames of 10 chars each is just 650kB in memory (+ the overhead of the python list) so it should not be a problem to load all the filenames ahead of time anyway.

any advice on structuring the filenames for each these discussion forum posts, for later identifying when I get the tfidf vectors and cosine similarity array

Just use a deterministic ordering to be able to sort the list of filenames before feeding them to the vectorizer.




回答2:


I was able to get these tasks.. in case it is helpful, below is the code for specifying a set of text files you want to use, and then how to set the flags and pass the filenames

path = "/wherever/yourfolder/oftextfiles/are"
filenames = os.listdir(path)
filenames.sort()

try:
    filenames.remove('.DS_Store') #Because I am on a MAC
except ValueError:
    pass # or scream: thing not in some_list!
except AttributeError:
    pass # call security, some_list not quacking like a list!

vectorizer = CountVectorizer(input='filename', analyzer='word', strip_accents='unicode', stop_words='english') 
X=vectorizer.fit_transform(filenames)

The mongo db part is basic but for what its worth (find all entries of type boardid 10 and sort by the messageid in ascending order):

cursor=coll.find({'boardid': 10 }).sort('messageid', 1)



来源:https://stackoverflow.com/questions/19419245/tf-idf-on-a-somewhat-large-65k-amount-of-text-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!