How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

后端未结

关注

 2  976

To manipulate Scipy matrices, typically, the built-in methods are used. But sometimes you need to read the matrix data to assign it to non-sparse data types. For the sake of

相关标签:

2条回答

灰色年华

2021-01-06 18:50
Try reading the raw data. Scipy sparse matrices are stored in Numpy ndarrays each with different format.

Reading the raw data of LIL sparse matrix
```
%%timeit -n3
for i, (row, data) in enumerate(zip(lil.rows, lil.data)):
    for j, val in zip(row, data):
        arr[i,j] = val
```
3 loops, best of 3: 4.61 ms per loop

Reading the raw data of CSR sparse matrix

For csr matrix it is a bit less pythonic to read from raw data but it is worth the speed.
```
csr = lil.tocsr()

%%timeit -n3
start = 0
for i, end in enumerate(csr.indptr[1:]):
    for j, val in zip(csr.indices[start:end], csr.data[start:end]):
        arr[i,j] = val
    start = end
```
3 loops, best of 3: 8.14 ms per loop

Similar approach is used in this DBSCAN implementation.

Reading the raw data of COO sparse matrix
```
%%timeit -n3
for i,j,d in zip(coo.row, coo.col, coo.data):
    arr[i,j] = d
```
3 loops, best of 3: 5.97 ms per loop

Based on these limited tests:
- COO matrix: cleanest
- LIL matrix: fastest
- CSR matrix: slowest and ugliest. The only good side is that conversion to/from CSR is extremely fast.
Edit: from @hpaulj I added COO matrix to have all the methods in one place.
0 讨论(0)
发布评论:

提交评论
- 加载中...
小蘑菇

2021-01-06 19:03
A similar question, but dealing setting sparse values, rather than just reading them:

Efficient incremental sparse matrix in python / scipy / numpy

More on accessing values using the underlying representation

Efficiently select random non-zero column from each row of sparse matrix in scipy

Also

why is row indexing of scipy csr matrices slower compared to numpy arrays

Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?

Take a look at what M.nonzero does:
```
    A = self.tocoo()
    nz_mask = A.data != 0
    return (A.row[nz_mask],A.col[nz_mask])
```
It converts the matrix to coo format and returns the .row, and .col attributes - after filtering out any 'stray' 0s in the .data attribute.

So you could skip the middle man and use those attributes directly:
```
 A = lil.tocoo()
 for i,j,d in zip(A.row, A.col, A.data):
      a[i,j] = d
```
This is almost as good as the toarray:
```
In [595]: %%timeit
   .....: aa = M.tocoo()
   .....: for i,j,d in zip(aa.row,aa.col,aa.data):
   .....:   A[i,j]=d
   .....: 
100 loops, best of 3: 14.3 ms per loop

In [596]: timeit  arr=M.toarray()
100 loops, best of 3: 12.3 ms per loop
```
But if your target is really an array, you don't need to iterate
```
In [603]: %%timeit
   .....: A=np.empty(M.shape,M.dtype)
   .....: aa=M.tocoo()
   .....: A[aa.row,aa.col]=aa.data
   .....: 
100 loops, best of 3: 8.22 ms per loop
```
My times for @Thoran's 2 methods are:
```
100 loops, best of 3: 5.81 ms per loop
100 loops, best of 3: 17.9 ms per loop
```
Same ballpark of times.
0 讨论(0)
发布评论:

提交评论
- 加载中...

How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

Reading the raw data of LIL sparse matrix

Reading the raw data of CSR sparse matrix

Reading the raw data of COO sparse matrix