How to select some rows from sparse matrix then use them form a new sparse matrix

问题

I have a very large sparse matrix(100000 column and 100000 rows). I want to select some of the rows of this sparse matrix and then use them to form a new sparse matrix. I tried to do it by first converting them to dense matrix and then convert them to sparse matrix again. But when I do this python raise a 'Memory error'. Then I tried another method, which is I select the rows of sparse matrix and then put them into a array, but when I try to convert this array to sparse matrix, it says: 'ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().' So how can I transform this list sparse matrix to a single big sparse matrix?

# X_train is a sparse matrix of size 100000x100000, it is in sparse form
# y_train is a 1 denmentional array with length 100000
# I try to get a new sparse matrix by using some rows of X_train, the 
#selection criteria is sum of the sparse row = 0

#y_train_new = []
#X_train_new = []
for i in range(len(y_train)):
    if np.sum(X_train[i].toarray()[0]) == 0:
        X_train_new.append(X_train[i])
        y_train_new.append(y_train[i])

And when I do:

X_train_new = scipy.sparse.csr_matrix(X_train_new)

I got the error message:

'ValueError: The truth value of an array with more than one element is 
ambiguous. Use a.any() or a.all().'

回答1:

I added some tags that would have helped me see your question sooner.

When asking about an error, it's a good idea to provide some or all of the traceback, so we can see where the error is occuring. Information on the inputs to the problem function call can also help.

Fortunately I can recreate the problem fairly easily - and in a reasonable size example. No need to make a 100000 x10000 matrix that no one can look at!

Make a modest size sparse matrix:

In [126]: M = sparse.random(10,10,.1,'csr')                                                              
In [127]: M                                                                                              
Out[127]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

I can do a whole matrix row sum, just as with a dense array. The sparse code actually uses matrix-vector multiplication to do this, producing a dense matrix.

In [128]: M.sum(axis=1)                                                                                  
Out[128]: 
matrix([[0.59659958],
        [0.80390719],
        [0.37251645],
        [0.        ],
        [0.85766909],
        [0.42267366],
        [0.76794737],
        [0.        ],
        [0.83131054],
        [0.46254634]])

It's sparse enough so that some rows have no zeros. With floats, especially in the 0-1 range, I'm not going to get rows where the nonzero values cancel out.

Or using your row by row calculation:

In [133]: alist = [np.sum(row.toarray()[0]) for row in M]                                                
In [134]: alist                                                                                          
Out[134]: 
[0.5965995802776853,
 0.8039071870427961,
 0.37251644566924424,
 0.0,
 0.8576690924353791,
 0.42267365715276595,
 0.7679473651419432,
 0.0,
 0.8313105376003095,
 0.4625463360625408]

And selecting the rows that do sum to zero (in this case empty ones):

In [135]: alist = [row for row in M if np.sum(row.toarray()[0])==0]                                      
In [136]: alist                                                                                          
Out[136]: 
[<1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>,
 <1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>]

Note that this is a list of sparse matrices. That's what you got too, right?

Now if I try to make matrix from that, I get your error:

In [137]: sparse.csr_matrix(alist)                                                                       
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-137-5e20e6fc2524> in <module>
----> 1 sparse.csr_matrix(alist)

/usr/local/lib/python3.6/dist-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
     86                                  "".format(self.format))
     87             from .coo import coo_matrix
---> 88             self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))
     89 
     90         # Read matrix dimensions given, if any

/usr/local/lib/python3.6/dist-packages/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy)
    189                                          (shape, self._shape))
    190 
--> 191                 self.row, self.col = M.nonzero()
    192                 self.data = M[self.row, self.col]
    193                 self.has_canonical_format = True

/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __bool__(self)
    285             return self.nnz != 0
    286         else:
--> 287             raise ValueError("The truth value of an array with more than one "
    288                              "element is ambiguous. Use a.any() or a.all().")
    289     __nonzero__ = __bool__

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

OK, this error doesn't tell me a whole lot (at least without more reading of the code), but it's clearly having problems with the input list. But read csr_matrix docs again! Does it say we can give it a list of sparse matrices?

But there is a sparse.vstack function will work with a list of matrices (modeled on the np.vstack):

In [140]: sparse.vstack(alist)                                                                           
Out[140]: 
<2x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>

We get more interesting results if we select the rows that don't sum to zero:

In [141]: alist = [row for row in M if np.sum(row.toarray()[0])!=0]                                      
In [142]: M1=sparse.vstack(alist)                                                                        
In [143]: M1                                                                                             
Out[143]: 
<8x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

But I showed before that we can get the row sums without iterating. Applying where to Out[128], I get the row indices (of the nonzero rows):

In [151]: idx=np.where(M.sum(axis=1))                                                                    
In [152]: idx                                                                                            
Out[152]: (array([0, 1, 2, 4, 5, 6, 8, 9]), array([0, 0, 0, 0, 0, 0, 0, 0]))
In [153]: M2=M[idx[0],:]                                                                                 
In [154]: M2                                                                                             
Out[154]: 
<8x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>
In [155]: np.allclose(M1.A, M2.A)                                                                        
Out[155]: True

====

I suspect the In[137] was produced trying to find the nonzero (np.where) elements of the input, or input cast as a numpy array:

In [159]: alist = [row for row in M if np.sum(row.toarray()[0])==0]                                      
In [160]: np.array(alist)                                                                                
Out[160]: 
array([<1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>,
       <1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Compressed Sparse Row format>], dtype=object)
In [161]: np.array(alist).nonzero()                                                                      
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-161-832a25987c15> in <module>
----> 1 np.array(alist).nonzero()

/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __bool__(self)
    285             return self.nnz != 0
    286         else:
--> 287             raise ValueError("The truth value of an array with more than one "
    288                              "element is ambiguous. Use a.any() or a.all().")
    289     __nonzero__ = __bool__

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

np.array on a list of sparse matrices produces an object dtype array of those matrices.

来源：https://stackoverflow.com/questions/56319794/how-to-select-some-rows-from-sparse-matrix-then-use-them-form-a-new-sparse-matri

标签

python

numpy

scipy

sparse-matrix