Sparse Efficiency Warning while changing the column

问题

def tdm_modify(feature_names,tdm):
    non_useful_words=['kill','stampede','trigger','cause','death','hospital'\
        ,'minister','said','told','say','injury','victim','report']
    indexes=[feature_names.index(word) for word in non_useful_words]
    for index in indexes:
        tdm[:,index]=0   
    return tdm

I want to manually set zero weights for some terms in tdm matrix. Using the above code I get the warning. I don't seem to understand why? Is there a better way to do this?

C:\Anaconda\lib\site-packages\scipy\sparse\compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)

回答1:

First, it is not an error. It's a warning. The next time you perform this action (in a session) it will do it without warning.

To me the message is clear:

Changing the sparsity structure of a csr_matrix is expensive. 
lil_matrix is more efficient.

tdm is a csr_matrix. The way that data is stored with the format, it takes quite a bit of extra computation to set a bunch of the elements to 0 (or v.v to change them from 0). As it says, the lil_matrix format is better if you need to do this sort of change frequently.

Try some time tests on a sample matrices. tdm.tolil() will convert the matrix to lil format.

I could get into how the data is stored and why changing csr is less efficient than lil.

I'd suggest reviewing the sparse formats, and their respective pros and cons.

A simple way to think about is - csr (and csc) are designed for fast numerical calculations, especially matrix multiplication. They developed for linear algebra problems. coo is a convenient way of defining sparse matrices. lil is a convenient way for building matrices incrementally.

How are you constructing tdm initially?

In scipy test files (e.g. scipy/sparse/linalg/dsolve/tests/test_linsolve.py) I find code that does

import warnings
from scipy.sparse import (spdiags, SparseEfficiencyWarning, csc_matrix,
    csr_matrix, isspmatrix, dok_matrix, lil_matrix, bsr_matrix)
warnings.simplefilter('ignore',SparseEfficiencyWarning)

scipy/sparse/base.py

class SparseWarning(Warning):
    pass
class SparseFormatWarning(SparseWarning):
    pass
class SparseEfficiencyWarning(SparseWarning):
    pass

These warnings use the standard Python Warning class, so standard Python methods for controlling their expression apply.

回答2:

I ran into this warning message as well working on a machine learning problem. The exact application was constructing a document term matrix from a corpus of text. I agree with the accepted answer. I will add one empirical observation:

My exact task was to build a 25000 x 90000 matrix of uint8. My desired output was a sparse matrix compressed row format, i.e. csr_matrix.

The fastest way to do this by far, at the cost of using quite a bit more memory in the interim, was to initialize a dense matrix using np.zeros(), build it up, then do csr_matrix(dense_matrix) once at the end.

The second fastest way was to build up a lil_matrix, then convert it to csr_matrix with the .tocsr() method. This is recommended in the accepted answer. (Thank you hpaulj).

The slowest way was to assemble the csr_matrix element by element.

So to sum up, if you have enough working memory to build a dense matrix, and only want to end up with a sparse matrix later on for downstream efficiency, it might be faster to build up the matrix in dense format and then covert it once at the end. If you need to work in sparse format the whole time because of memory limitations, building up the matrix as a lil_matrix and then converting it (as in the accepted answer) is faster than building up a csr_matrix from the start.

来源：https://stackoverflow.com/questions/33091397/sparse-efficiency-warning-while-changing-the-column

标签

python

numpy

scipy

nlp