可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm wondering what the best way is to iterate nonzero entries of sparse matrices with scipy.sparse. For example, if I do the following:
from scipy.sparse import lil_matrix x = lil_matrix( (20,1) ) x[13,0] = 1 x[15,0] = 2 c = 0 for i in x: print c, i c = c+1
the output is
0 1 2 3 4 5 6 7 8 9 10 11 12 13 (0, 0) 1.0 14 15 (0, 0) 2.0 16 17 18 19
so it appears the iterator is touching every element, not just the nonzero entries. I've had a look at the API
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html
and searched around a bit, but I can't seem to find a solution that works.
回答1:
Edit: bbtrb's method (using coo_matrix) is much faster than my original suggestion, using nonzero. Sven Marnach's suggestion to use itertools.izip
also improves the speed. Current fastest is using_tocoo_izip
:
import scipy.sparse import random import itertools def using_nonzero(x): rows,cols = x.nonzero() for row,col in zip(rows,cols): ((row,col), x[row,col]) def using_coo(x): cx = scipy.sparse.coo_matrix(x) for i,j,v in zip(cx.row, cx.col, cx.data): (i,j,v) def using_tocoo(x): cx = x.tocoo() for i,j,v in zip(cx.row, cx.col, cx.data): (i,j,v) def using_tocoo_izip(x): cx = x.tocoo() for i,j,v in itertools.izip(cx.row, cx.col, cx.data): (i,j,v) N=200 x = scipy.sparse.lil_matrix( (N,N) ) for _ in xrange(N): x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)
yields these timeit
results:
% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)' 1000 loops, best of 3: 670 usec per loop % python -mtimeit -s'import test' 'test.using_tocoo(test.x)' 1000 loops, best of 3: 706 usec per loop % python -mtimeit -s'import test' 'test.using_coo(test.x)' 1000 loops, best of 3: 802 usec per loop % python -mtimeit -s'import test' 'test.using_nonzero(test.x)' 100 loops, best of 3: 5.25 msec per loop
回答2:
The fastest way should be by converting to a coo_matrix
:
cx = scipy.sparse.coo_matrix(x) for i,j,v in zip(cx.row, cx.col, cx.data): print "(%d, %d), %s" % (i,j,v)
回答3:
To loop a variety of sparse matrices from the scipy.sparse
code section I would use this small wrapper function (note that for Python-2 you are encouraged to use xrange
and izip
for better performance on large matrices):
from scipy.sparse import * def iter_spmatrix(matrix): """ Iterator for iterating the elements in a ``scipy.sparse.*_matrix`` This will always return: >>> (row, column, matrix-element) Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added. Parameters ---------- matrix : ``scipy.sparse.sp_matrix`` the sparse matrix to iterate non-zero elements """ if isspmatrix_coo(matrix): for r, c, m in zip(matrix.row, matrix.col, matrix.data): yield r, c, m elif isspmatrix_csc(matrix): for c in range(matrix.shape[1]): for ind in range(matrix.indptr[c], matrix.indptr[c+1]): yield matrix.indices[ind], c, matrix.data[ind] elif isspmatrix_csr(matrix): for r in range(matrix.shape[0]): for ind in range(matrix.indptr[r], matrix.indptr[r+1]): yield r, matrix.indices[ind], matrix.data[ind] elif isspmatrix_lil(matrix): for r in range(matrix.shape[0]): for c, d in zip(matrix.rows[r], matrix.data[r]): yield r, c, d else: raise NotImplementedError("The iterator for this sparse matrix has not been implemented")
回答4:
I had the same problem and actually, if your concern is only speed, the fastest way (more than 1 order of magnitude faster) is to convert the sparse matrix to a dense one (x.todense()), and iterating over the nonzero elements in the dense matrix. (Though, of course, this approach requires a lot more memory)
回答5:
tocoo() materializes the entire matrix into a different structure, which is not the preferred MO for python 3. You can also consider this iterator, which is especially useful for large matrices.
from itertools import chain, repeat def iter_csr(matrix): for (row, col, val) in zip( chain(*( repeat(i, r) for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1]) )), matrix.indices, matrix.data ): yield (row, col, val)
I have to admit that I'm using a lot of python-constructs which possibly should be replaced by numpy-constructs (especially enumerate).
NB:
In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t) 52.48686504364014 In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t) 70.19013023376465 In [45]: rather_dense_sparse_matrix ' with 757622819 stored elements in Compressed Sparse Row format>
So yes, enumerate is somewhat slow(ish)
For the iterator:
In [47]: it = iter_csr(rather_dense_sparse_matrix) In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t) 113.something something
So you decide whether this overhead is acceptable, in my case the tocoo caused MemoryOverflows
's.
IMHO: such an iterator should be part of the csr_matrix interface, similar to items() in a dict() :)
回答6:
Try filter(lambda x:x, x)
instead of x
.