Iterating through a scipy.sparse vector (or matrix)

匿名 (未验证) 提交于 2019-12-03 08:30:34

问题:

I'm wondering what the best way is to iterate nonzero entries of sparse matrices with scipy.sparse. For example, if I do the following:

from scipy.sparse import lil_matrix  x = lil_matrix( (20,1) ) x[13,0] = 1 x[15,0] = 2  c = 0 for i in x:   print c, i   c = c+1 

the output is

0  1  2  3  4  5  6  7  8  9  10  11  12  13   (0, 0) 1.0 14  15   (0, 0) 2.0 16  17  18  19   

so it appears the iterator is touching every element, not just the nonzero entries. I've had a look at the API

http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html

and searched around a bit, but I can't seem to find a solution that works.

回答1:

Edit: bbtrb's method (using coo_matrix) is much faster than my original suggestion, using nonzero. Sven Marnach's suggestion to use itertools.izip also improves the speed. Current fastest is using_tocoo_izip:

import scipy.sparse import random import itertools  def using_nonzero(x):     rows,cols = x.nonzero()     for row,col in zip(rows,cols):         ((row,col), x[row,col])  def using_coo(x):     cx = scipy.sparse.coo_matrix(x)         for i,j,v in zip(cx.row, cx.col, cx.data):         (i,j,v)  def using_tocoo(x):     cx = x.tocoo()         for i,j,v in zip(cx.row, cx.col, cx.data):         (i,j,v)  def using_tocoo_izip(x):     cx = x.tocoo()         for i,j,v in itertools.izip(cx.row, cx.col, cx.data):         (i,j,v)  N=200 x = scipy.sparse.lil_matrix( (N,N) ) for _ in xrange(N):     x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100) 

yields these timeit results:

% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)' 1000 loops, best of 3: 670 usec per loop % python -mtimeit -s'import test' 'test.using_tocoo(test.x)' 1000 loops, best of 3: 706 usec per loop % python -mtimeit -s'import test' 'test.using_coo(test.x)' 1000 loops, best of 3: 802 usec per loop % python -mtimeit -s'import test' 'test.using_nonzero(test.x)' 100 loops, best of 3: 5.25 msec per loop 


回答2:

The fastest way should be by converting to a coo_matrix:

cx = scipy.sparse.coo_matrix(x)  for i,j,v in zip(cx.row, cx.col, cx.data):     print "(%d, %d), %s" % (i,j,v) 


回答3:

To loop a variety of sparse matrices from the scipy.sparse code section I would use this small wrapper function (note that for Python-2 you are encouraged to use xrange and izip for better performance on large matrices):

from scipy.sparse import * def iter_spmatrix(matrix):     """ Iterator for iterating the elements in a ``scipy.sparse.*_matrix``       This will always return:     >>> (row, column, matrix-element)      Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.      Parameters     ----------     matrix : ``scipy.sparse.sp_matrix``       the sparse matrix to iterate non-zero elements     """     if isspmatrix_coo(matrix):         for r, c, m in zip(matrix.row, matrix.col, matrix.data):             yield r, c, m      elif isspmatrix_csc(matrix):         for c in range(matrix.shape[1]):             for ind in range(matrix.indptr[c], matrix.indptr[c+1]):                 yield matrix.indices[ind], c, matrix.data[ind]      elif isspmatrix_csr(matrix):         for r in range(matrix.shape[0]):             for ind in range(matrix.indptr[r], matrix.indptr[r+1]):                 yield r, matrix.indices[ind], matrix.data[ind]      elif isspmatrix_lil(matrix):         for r in range(matrix.shape[0]):             for c, d in zip(matrix.rows[r], matrix.data[r]):                 yield r, c, d      else:         raise NotImplementedError("The iterator for this sparse matrix has not been implemented") 


回答4:

I had the same problem and actually, if your concern is only speed, the fastest way (more than 1 order of magnitude faster) is to convert the sparse matrix to a dense one (x.todense()), and iterating over the nonzero elements in the dense matrix. (Though, of course, this approach requires a lot more memory)



回答5:

tocoo() materializes the entire matrix into a different structure, which is not the preferred MO for python 3. You can also consider this iterator, which is especially useful for large matrices.

from itertools import chain, repeat def iter_csr(matrix):   for (row, col, val) in zip(     chain(*(           repeat(i, r)           for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1])     )),     matrix.indices,     matrix.data   ):     yield (row, col, val) 

I have to admit that I'm using a lot of python-constructs which possibly should be replaced by numpy-constructs (especially enumerate).

NB:

In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t) 52.48686504364014 In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t) 70.19013023376465 In [45]: rather_dense_sparse_matrix ' with 757622819 stored elements in Compressed Sparse Row format> 

So yes, enumerate is somewhat slow(ish)

For the iterator:

In [47]: it = iter_csr(rather_dense_sparse_matrix) In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t) 113.something something 

So you decide whether this overhead is acceptable, in my case the tocoo caused MemoryOverflows's.

IMHO: such an iterator should be part of the csr_matrix interface, similar to items() in a dict() :)



回答6:

Try filter(lambda x:x, x) instead of x.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!