问题
I'm trying to use Python's Tfidf to transform a corpus of text. However, when I try to fit_transform it, I get a value error ValueError: empty vocabulary; perhaps the documents only contain stop words.
In [69]: TfidfVectorizer().fit_transform(smallcorp)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-69-ac16344f3129> in <module>()
----> 1 TfidfVectorizer().fit_transform(smallcorp)
/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
1217 vectors : array, [n_samples, n_features]
1218 """
-> 1219 X = super(TfidfVectorizer, self).fit_transform(raw_documents)
1220 self._tfidf.fit(X)
1221 # X is already a transformed view of raw_documents so
/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
778 max_features = self.max_features
779
--> 780 vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
781 X = X.tocsc()
782
/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
725 vocabulary = dict(vocabulary)
726 if not vocabulary:
--> 727 raise ValueError("empty vocabulary; perhaps the documents only"
728 " contain stop words")
729
ValueError: empty vocabulary; perhaps the documents only contain stop words
I read through the SO question here: Problems using a custom vocabulary for TfidfVectorizer scikit-learn and tried ogrisel's suggestion of using TfidfVectorizer(**params).build_analyzer()(dataset2) to check the results of the text analysis step and that seems to be working as expected: snippet below:
In [68]: TfidfVectorizer().build_analyzer()(smallcorp)
Out[68]:
[u'due',
u'to',
u'lack',
u'of',
u'personal',
u'biggest',
u'education',
u'and',
u'husband',
u'to',
Is there something else that I am doing wrong? the corpus I am feeding it is just one giant long string punctuated by newlines.
Thanks!
回答1:
I guess it's because you just have one string. Try splitting it into a list of strings, e.g.:
In [51]: smallcorp
Out[51]: 'Ah! Now I have done Philosophy,\nI have finished Law and Medicine,\nAnd sadly even Theology:\nTaken fierce pains, from end to end.\nNow here I am, a fool for sure!\nNo wiser than I was before:'
In [52]: tf = TfidfVectorizer()
In [53]: tf.fit_transform(smallcorp.split('\n'))
Out[53]:
<6x28 sparse matrix of type '<type 'numpy.float64'>'
with 31 stored elements in Compressed Sparse Row format>
回答2:
In version 0.12, we set the minimum document frequency to 2, which means that only words that appear at least twice will be considered. For your example to work, you need to set min_df=1
. Since 0.13, this is the default setting.
So I guess you are using 0.12, right?
回答3:
You can alternatively put your single string as a tuple, if you insist to have only one string. Instead of having:
smallcorp = "your text"
you'd rather put it within a tuple.
In [22]: smallcorp = ("your text",)
In [23]: tf.fit_transform(smallcorp)
Out[23]:
<1x2 sparse matrix of type '<type 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>
回答4:
I encountered a similar error while running a TF-IDF Python 3 script over a large corpus. Some small files (apparently) lacked keywords, throwing an error message.
I tried several solutions (adding dummy strings to my filtered
list if len(filtered = 0
, ...) that did not help. The simplest solution was to add a try: ... except ... continue
expression.
pattern = "(?u)\\b[\\w-]+\\b"
cv = CountVectorizer(token_pattern=pattern)
# filtered is a list
filtered = [w for w in filtered if not w in my_stopwords and not w.isdigit()]
# ValueError:
# cv.fit(text)
# File "tfidf-sklearn.py", line 1675, in tfidf
# cv.fit(filtered)
# File "/home/victoria/venv/py37/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 1024, in fit
# self.fit_transform(raw_documents)
# ...
# ValueError: empty vocabulary; perhaps the documents only contain stop words
# Did not help:
# https://stackoverflow.com/a/20933883/1904943
#
# if len(filtered) == 0:
# filtered = ['xxx', 'yyy', 'zzz']
# Solution:
try:
cv.fit(filtered)
cv.fit_transform(filtered)
doc_freq_term_matrix = cv.transform(filtered)
except ValueError:
continue
来源:https://stackoverflow.com/questions/20928769/python-tfidfvectorizer-throwing-empty-vocabulary-perhaps-the-documents-only-c