Iterating through a scipy.sparse vector (or matrix)

前端未结

关注

 6  741

I\'m wondering what the best way is to iterate nonzero entries of sparse matrices with scipy.sparse. For example, if I do the following:

from scipy.sparse im


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  逝去的感伤        
                
              
                            
                2020-11-30 22:52
              
            
            
                                                                       
tocoo() materializes the entire matrix into a different structure, which is not the preferred MO for python 3. You can also consider this iterator, which is especially useful for large matrices.



from itertools import chain, repeat
def iter_csr(matrix):
  for (row, col, val) in zip(
    chain(*(
          repeat(i, r)
          for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1])
    )),
    matrix.indices,
    matrix.data
  ):
    yield (row, col, val)


I have to admit that I'm using a lot of python-constructs which possibly should be replaced by numpy-constructs (especially enumerate).

NB:

In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t)
52.48686504364014
In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t)
70.19013023376465
In [45]: rather_dense_sparse_matrix
<99829x99829 sparse matrix of type '<class 'numpy.float16'>'
with 757622819 stored elements in Compressed Sparse Row format>


So yes, enumerate is somewhat slow(ish)

For the iterator:

In [47]: it = iter_csr(rather_dense_sparse_matrix)
In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t)
113.something something


So you decide whether this overhead is acceptable, in my case the tocoo caused MemoryOverflows's.

IMHO: such an iterator should be part of the csr_matrix interface, similar to items() in a dict() :)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2020-11-30 22:55
              
            
            
                                                                       
Edit: bbtrb's method (using coo_matrix) is much faster than my original suggestion, using nonzero. Sven Marnach's suggestion to use itertools.izip also improves the speed. Current fastest is using_tocoo_izip:

import scipy.sparse
import random
import itertools

def using_nonzero(x):
    rows,cols = x.nonzero()
    for row,col in zip(rows,cols):
        ((row,col), x[row,col])

def using_coo(x):
    cx = scipy.sparse.coo_matrix(x)    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo(x):
    cx = x.tocoo()    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo_izip(x):
    cx = x.tocoo()    
    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
        (i,j,v)

N=200
x = scipy.sparse.lil_matrix( (N,N) )
for _ in xrange(N):
    x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)


yields these timeit results:

% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)'
1000 loops, best of 3: 670 usec per loop
% python -mtimeit -s'import test' 'test.using_tocoo(test.x)'
1000 loops, best of 3: 706 usec per loop
% python -mtimeit -s'import test' 'test.using_coo(test.x)'
1000 loops, best of 3: 802 usec per loop
% python -mtimeit -s'import test' 'test.using_nonzero(test.x)'
100 loops, best of 3: 5.25 msec per loop

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2020-11-30 22:59
              
            
            
                                                                       
The fastest way should be by converting to a coo_matrix:

cx = scipy.sparse.coo_matrix(x)

for i,j,v in zip(cx.row, cx.col, cx.data):
    print "(%d, %d), %s" % (i,j,v)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2020-11-30 22:59
              
            
            
                                                                       
To loop a variety of sparse matrices from the scipy.sparse code section I would use this small wrapper function (note that for Python-2 you are encouraged to use xrange and izip for better performance on large matrices):

from scipy.sparse import *
def iter_spmatrix(matrix):
    """ Iterator for iterating the elements in a ``scipy.sparse.*_matrix`` 

    This will always return:
    >>> (row, column, matrix-element)

    Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.

    Parameters
    ----------
    matrix : ``scipy.sparse.sp_matrix``
      the sparse matrix to iterate non-zero elements
    """
    if isspmatrix_coo(matrix):
        for r, c, m in zip(matrix.row, matrix.col, matrix.data):
            yield r, c, m

    elif isspmatrix_csc(matrix):
        for c in range(matrix.shape[1]):
            for ind in range(matrix.indptr[c], matrix.indptr[c+1]):
                yield matrix.indices[ind], c, matrix.data[ind]

    elif isspmatrix_csr(matrix):
        for r in range(matrix.shape[0]):
            for ind in range(matrix.indptr[r], matrix.indptr[r+1]):
                yield r, matrix.indices[ind], matrix.data[ind]

    elif isspmatrix_lil(matrix):
        for r in range(matrix.shape[0]):
            for c, d in zip(matrix.rows[r], matrix.data[r]):
                yield r, c, d

    else:
        raise NotImplementedError("The iterator for this sparse matrix has not been implemented")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2020-11-30 22:59
              
            
            
                                                                       
Try filter(lambda x:x, x) instead of x.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  自闭症患者        
                
              
                            
                2020-11-30 23:11
              
            
            
                                                                       
I had the same problem and actually, if your concern is only speed, the fastest way (more than 1 order of magnitude faster) is to convert the sparse matrix to a dense one (x.todense()), and iterating over the nonzero elements in the dense matrix. (Though, of course, this approach requires a lot more memory)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复