Sklearn TFIDF on large corpus of documents

问题

In the context of an internship project, I have to perform a tfidf analyse over a large set of files (~18000). I am trying to use the TFIDF vectorizer from sklearn, but I'm facing the following issue : how can I avoid loading all the files at once in memory ? According to what I read on other posts, it seems to be feasible using an iterable, but if I use for instance [open(file) for file in os.listdir(path)] as the raw_documents input to the fit_transform() function, I am getting a 'too many open files' error. Thanks in advance for you suggestions ! Cheers ! Paul

回答1:

Have you tried input='filename' param in TfidfVectorizer? Something like this:

raw_docs_filepaths = [#List containing the filepaths of all the files]

tfidf_vectorizer =  TfidfVectorizer(`input='filename'`)
tfidf_data = tfidf_vectorizer.fit_transform(raw_docs_filepaths)

This should work, because in this, the vectorizer will open a single file at a time, when processing that. This can be confirmed by cross-checking the source code here

def decode(self, doc):
...
...
    if self.input == 'filename':
        with open(doc, 'rb') as fh:
            doc = fh.read()
...
...

来源：https://stackoverflow.com/questions/51422688/sklearn-tfidf-on-large-corpus-of-documents

标签

python

scikit-learn

tfidfvectorizer

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!