Dealing with a large amount of unique words for text processing/tf-idf etc

问题

I am using scikit to do some text processing, such as tfidf. The amount of filenames is being handled fine (~40k). But as far as the number of unique words, I am not able to deal with the array/matrix, whether it is to get the size of the amount of unique words printed, or to dump the numpy array to a file (using savetxt). Below is the traceback. If I could get the top values of the tfidf, as I dont need them for every single word for every single document. Or, I could exclude other words from the calculations (not stop words, but a separate set of words in a text file I could add that would be excluded). Though, I don't know if the words I would take out would alleviate this situation. Finally, if I could somehow grab pieces of the matrix, that could work too. Any example of dealing with this kind of thing will be helpful and give me some starting points of ideas. (PS I looked at and tried Hashingvectorizer but it doesnt seem I can do tfidf with it?)

Traceback (most recent call last):
  File "/sklearn.py", line 40, in <module>
    array = X.toarray()
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 790, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 239, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/base.py", line 699, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
ValueError: array is too big.

Relevant code:

path = "/home/files/"

fh = open('output.txt','w')


filenames = os.listdir(path)

filenames.sort()

try:
    filenames.remove('.DS_Store')
except ValueError:
    pass # or scream: thing not in some_list!
except AttributeError:
    pass # call security, some_list not quacking like a list!

vectorizer = CountVectorizer(input='filename', analyzer='word', strip_accents='unicode', stop_words='english') 
X=vectorizer.fit_transform(filenames)
fh.write(str(vectorizer.vocabulary_))

array = X.toarray()
print array.size
print array.shape

Edit: In case this helps,

print 'Array is:' + str(X.get_shape()[0])  + ' by ' + str(X.get_shape()[1]) + ' matrix.'

Get the dimension of the too large sparse matrix, in my case:

Array is: 39436 by 113214 matrix.

回答1:

The traceback holds the answer here: when you call X.toarray() at the end, it's converting a sparse matrix representation to a dense representation. This means that instead of storing a constant amount of data for each word in each document, you're now storing a value for all words over all documents.

Thankfully, most operations work with sparse matrices, or have sparse variants. Just avoid calling .toarray() or .todense() and you'll be good to go.

For more information, check out the scipy sparse matrix documentation.

来源：https://stackoverflow.com/questions/19920808/dealing-with-a-large-amount-of-unique-words-for-text-processing-tf-idf-etc

标签

numpy

scipy

scikit-learn

tf-idf