TfidfVectorizer for corpus that cannot fit in memory

前端 未结 1 1243
南笙
南笙 2021-02-02 03:07

I want to build a tf-idf model based on a corpus that cannot fit in memory. I read the tutorial but the corpus seems to be loaded at once:

from sklearn.feature_e         


        
相关标签:
1条回答
  • 2021-02-02 03:29

    Yes you can, just make your corpus an iterator. For example, if your documents reside on a disc, you can define an iterator that takes as an argument the list of file names, and returns the documents one by one without loading everything into memory at once.

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    def make_corpus(doc_files):
        for doc in doc_files:
            yield load_doc_from_file(doc) #load_doc_from_file is a custom function for loading a doc from file
    
    file_list = ... # list of files you want to load
    corpus = make_corpus(file_list)
    vectorizer = TfidfVectorizer(min_df=1)
    vectorizer.fit(corpus)
    
    0 讨论(0)
提交回复
热议问题