I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:
tokenized_list_of_sentences = [[\'this\', \'is\', \'one\'],
Try preprocessor instead of tokenizer
.
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
If x
in the above error message is a list, then doing x.lower()
to a list will throw the error.
Your two examples are all stopwords so to make this example return something, throw in a few random words. Here's an example:
tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
['this', 'is', 'another', 'dog']]
tfidf = TfidfVectorizer(preprocessor=' '.join, stop_words='english')
tfidf.fit_transform(tokenized_sentences)
Returns:
<2x2 sparse matrix of type ''
with 3 stored elements in Compressed Sparse Row format>
Features:
>>> tfidf.get_feature_names()
['cat', 'dog']
UPDATE: maybe use lambda
s on tokenizer and preprocessor?
tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
['this', 'is', 'another', 'dog']]
tfidf = TfidfVectorizer(tokenizer=lambda x: x,
preprocessor=lambda x: x, stop_words='english')
tfidf.fit_transform(tokenized_sentences)
<2x2 sparse matrix of type ''
with 3 stored elements in Compressed Sparse Row format>
>>> tfidf.get_feature_names()
['cat', 'dog']