Iterating through a scipy.sparse vector (or matrix)

前端 未结 6 741
你的背包
你的背包 2020-11-30 22:31

I\'m wondering what the best way is to iterate nonzero entries of sparse matrices with scipy.sparse. For example, if I do the following:

from scipy.sparse im         


        
相关标签:
6条回答
  • 2020-11-30 22:52

    tocoo() materializes the entire matrix into a different structure, which is not the preferred MO for python 3. You can also consider this iterator, which is especially useful for large matrices.

    from itertools import chain, repeat
    def iter_csr(matrix):
      for (row, col, val) in zip(
        chain(*(
              repeat(i, r)
              for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1])
        )),
        matrix.indices,
        matrix.data
      ):
        yield (row, col, val)
    

    I have to admit that I'm using a lot of python-constructs which possibly should be replaced by numpy-constructs (especially enumerate).

    NB:

    In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t)
    52.48686504364014
    In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t)
    70.19013023376465
    In [45]: rather_dense_sparse_matrix
    <99829x99829 sparse matrix of type '<class 'numpy.float16'>'
    with 757622819 stored elements in Compressed Sparse Row format>
    

    So yes, enumerate is somewhat slow(ish)

    For the iterator:

    In [47]: it = iter_csr(rather_dense_sparse_matrix)
    In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t)
    113.something something
    

    So you decide whether this overhead is acceptable, in my case the tocoo caused MemoryOverflows's.

    IMHO: such an iterator should be part of the csr_matrix interface, similar to items() in a dict() :)

    0 讨论(0)
  • 2020-11-30 22:55

    Edit: bbtrb's method (using coo_matrix) is much faster than my original suggestion, using nonzero. Sven Marnach's suggestion to use itertools.izip also improves the speed. Current fastest is using_tocoo_izip:

    import scipy.sparse
    import random
    import itertools
    
    def using_nonzero(x):
        rows,cols = x.nonzero()
        for row,col in zip(rows,cols):
            ((row,col), x[row,col])
    
    def using_coo(x):
        cx = scipy.sparse.coo_matrix(x)    
        for i,j,v in zip(cx.row, cx.col, cx.data):
            (i,j,v)
    
    def using_tocoo(x):
        cx = x.tocoo()    
        for i,j,v in zip(cx.row, cx.col, cx.data):
            (i,j,v)
    
    def using_tocoo_izip(x):
        cx = x.tocoo()    
        for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
            (i,j,v)
    
    N=200
    x = scipy.sparse.lil_matrix( (N,N) )
    for _ in xrange(N):
        x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)
    

    yields these timeit results:

    % python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)'
    1000 loops, best of 3: 670 usec per loop
    % python -mtimeit -s'import test' 'test.using_tocoo(test.x)'
    1000 loops, best of 3: 706 usec per loop
    % python -mtimeit -s'import test' 'test.using_coo(test.x)'
    1000 loops, best of 3: 802 usec per loop
    % python -mtimeit -s'import test' 'test.using_nonzero(test.x)'
    100 loops, best of 3: 5.25 msec per loop
    
    0 讨论(0)
  • 2020-11-30 22:59

    The fastest way should be by converting to a coo_matrix:

    cx = scipy.sparse.coo_matrix(x)
    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        print "(%d, %d), %s" % (i,j,v)
    
    0 讨论(0)
  • 2020-11-30 22:59

    To loop a variety of sparse matrices from the scipy.sparse code section I would use this small wrapper function (note that for Python-2 you are encouraged to use xrange and izip for better performance on large matrices):

    from scipy.sparse import *
    def iter_spmatrix(matrix):
        """ Iterator for iterating the elements in a ``scipy.sparse.*_matrix`` 
    
        This will always return:
        >>> (row, column, matrix-element)
    
        Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.
    
        Parameters
        ----------
        matrix : ``scipy.sparse.sp_matrix``
          the sparse matrix to iterate non-zero elements
        """
        if isspmatrix_coo(matrix):
            for r, c, m in zip(matrix.row, matrix.col, matrix.data):
                yield r, c, m
    
        elif isspmatrix_csc(matrix):
            for c in range(matrix.shape[1]):
                for ind in range(matrix.indptr[c], matrix.indptr[c+1]):
                    yield matrix.indices[ind], c, matrix.data[ind]
    
        elif isspmatrix_csr(matrix):
            for r in range(matrix.shape[0]):
                for ind in range(matrix.indptr[r], matrix.indptr[r+1]):
                    yield r, matrix.indices[ind], matrix.data[ind]
    
        elif isspmatrix_lil(matrix):
            for r in range(matrix.shape[0]):
                for c, d in zip(matrix.rows[r], matrix.data[r]):
                    yield r, c, d
    
        else:
            raise NotImplementedError("The iterator for this sparse matrix has not been implemented")
    
    0 讨论(0)
  • 2020-11-30 22:59

    Try filter(lambda x:x, x) instead of x.

    0 讨论(0)
  • 2020-11-30 23:11

    I had the same problem and actually, if your concern is only speed, the fastest way (more than 1 order of magnitude faster) is to convert the sparse matrix to a dense one (x.todense()), and iterating over the nonzero elements in the dense matrix. (Though, of course, this approach requires a lot more memory)

    0 讨论(0)
提交回复
热议问题