Use sklearn TfidfVectorizer with already tokenized inputs?

后端 未结 3 558
闹比i
闹比i 2021-02-05 14:29

I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:

tokenized_list_of_sentences = [[\'this\', \'is\', \'one\'],         


        
相关标签:
3条回答
  • 2021-02-05 14:46

    Like @Jarad said just use a "passthrough" function for your analyzer but it needs to ignore stopwords. You can get stop words from sklearn:

    >>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
    

    or from nltk:

    >>> import nltk
    >>> nltk.download('stopwords')
    >>> from nltk.corpus import stopwords
    >>> stop_words = set(stopwords.words('english'))
    

    or combine both sets:

    stop_words = stop_words.union(ENGLISH_STOP_WORDS)
    

    But then your examples contain only stop words (because all your words are in the sklearn.ENGLISH_STOP_WORDS set).

    Noetheless @Jarad's examples work:

    >>> tokenized_list_of_sentences =  [
    ...     ['this', 'is', 'one', 'cat', 'or', 'dog'],
    ...     ['this', 'is', 'another', 'dog']]
    >>> from sklearn.feature_extraction.text import TfidfVectorizer
    >>> tfidf = TfidfVectorizer(analyzer=lambda x:[w for w in x if w not in stop_words])
    >>> tfidf_vectors = tfidf.fit_transform(tokenized_list_of_sentences)
    

    I like pd.DataFrames for browsing TF-IDF vectors:

    >>> import pandas as pd
    >>> pd.DataFrame(tfidf_vectors.todense(), columns=tfidf.vocabulary_)
            cat       dog 
    0  0.814802  0.579739
    1  0.000000  1.000000
    
    0 讨论(0)
  • 2021-02-05 15:03

    Try preprocessor instead of tokenizer.

        return lambda x: strip_accents(x.lower())
    AttributeError: 'list' object has no attribute 'lower'
    

    If x in the above error message is a list, then doing x.lower() to a list will throw the error.

    Your two examples are all stopwords so to make this example return something, throw in a few random words. Here's an example:

    tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
                           ['this', 'is', 'another', 'dog']]
    
    tfidf = TfidfVectorizer(preprocessor=' '.join, stop_words='english')
    tfidf.fit_transform(tokenized_sentences)
    

    Returns:

    <2x2 sparse matrix of type '<class 'numpy.float64'>'
        with 3 stored elements in Compressed Sparse Row format>
    

    Features:

    >>> tfidf.get_feature_names()
    ['cat', 'dog']
    

    UPDATE: maybe use lambdas on tokenizer and preprocessor?

    tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
                           ['this', 'is', 'another', 'dog']]
    
    tfidf = TfidfVectorizer(tokenizer=lambda x: x,
                            preprocessor=lambda x: x, stop_words='english')
    tfidf.fit_transform(tokenized_sentences)
    
    <2x2 sparse matrix of type '<class 'numpy.float64'>'
        with 3 stored elements in Compressed Sparse Row format>
    >>> tfidf.get_feature_names()
    ['cat', 'dog']
    
    0 讨论(0)
  • 2021-02-05 15:08

    Try initializing the TfidfVectorizer object with the parameter lowercase=False (assuming this is actually desired as you've lowercased your tokens in previous stages).

    tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]
    
    def identity_tokenizer(text):
        return text
    
    tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)    
    tfidf.fit_transform(tokenized_list_of_sentences)
    

    Note that I changed the sentences as they apparently only contained stop words which caused another error due to an empty vocabulary.

    0 讨论(0)
提交回复
热议问题