How can I use a list of lists, or a list of sets, for the TfidfVectorizer?

问题

I'm using the sklearn TfidfVectorizer for text-classification.

I know this vectorizer wants raw text as input, but using a list works (see input1).

However, if I want to use multiple lists (or sets) I get the following Attribute error.

Does anyone know how to tackle this problem? Thanks in advance!

    from sklearn.feature_extraction.text import TfidfVectorizer

    vectorizer = TfidfVectorizer(min_df=1, stop_words="english")
    input1 = ["This", "is", "a", "test"]
    input2 = [["This", "is", "a", "test"], ["It", "is", "raining", "today"]]

    print(vectorizer.fit_transform(input1)) #works
    print(vectorizer.fit_transform(input2)) #gives Attribute error

input 1:
  (3, 0)    1.0

input 2:

Traceback (most recent call last): File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1381, in fit_transform X = super(TfidfVectorizer, self).fit_transform(raw_documents) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform self.fixed_vocabulary_) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab for feature in analyze(doc): File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 266, in tokenize(preprocess(self.decode(doc))), stop_words) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 232, in return lambda x: strip_accents(x.lower()) AttributeError: 'list' object has no attribute 'lower'

回答1:

Note that input1 works, but it considers each element of the list (string) as a different document to vectorize.

In the case of input2, I assume you want to vectorize each "sentence" (sublists). One solution is using the following list comprehension syntax:

input2_corrected = [" ".join(x) for x in input2]

which produces

['This is a test', 'It is raining today']

which does not yield the AttributeError anymore.

来源：https://stackoverflow.com/questions/50633153/how-can-i-use-a-list-of-lists-or-a-list-of-sets-for-the-tfidfvectorizer

标签

python

python-3.x

scikit-learn

text-classification

tfidfvectorizer