Use sklearn TfidfVectorizer with already tokenized inputs?

后端 未结 3 551
闹比i
闹比i 2021-02-05 14:29

I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:

tokenized_list_of_sentences = [[\'this\', \'is\', \'one\'],         


        
3条回答
  •  清酒与你
    2021-02-05 15:08

    Try initializing the TfidfVectorizer object with the parameter lowercase=False (assuming this is actually desired as you've lowercased your tokens in previous stages).

    tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]
    
    def identity_tokenizer(text):
        return text
    
    tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)    
    tfidf.fit_transform(tokenized_list_of_sentences)
    

    Note that I changed the sentences as they apparently only contained stop words which caused another error due to an empty vocabulary.

提交回复
热议问题