How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

后端 未结 2 975
悲哀的现实
悲哀的现实 2021-01-06 18:44

To manipulate Scipy matrices, typically, the built-in methods are used. But sometimes you need to read the matrix data to assign it to non-sparse data types. For the sake of

相关标签:
2条回答
  • 2021-01-06 18:50

    Try reading the raw data. Scipy sparse matrices are stored in Numpy ndarrays each with different format.

    Reading the raw data of LIL sparse matrix

    %%timeit -n3
    for i, (row, data) in enumerate(zip(lil.rows, lil.data)):
        for j, val in zip(row, data):
            arr[i,j] = val
    

    3 loops, best of 3: 4.61 ms per loop

    Reading the raw data of CSR sparse matrix

    For csr matrix it is a bit less pythonic to read from raw data but it is worth the speed.

    csr = lil.tocsr()
    
    %%timeit -n3
    start = 0
    for i, end in enumerate(csr.indptr[1:]):
        for j, val in zip(csr.indices[start:end], csr.data[start:end]):
            arr[i,j] = val
        start = end
    

    3 loops, best of 3: 8.14 ms per loop

    Similar approach is used in this DBSCAN implementation.

    Reading the raw data of COO sparse matrix

    %%timeit -n3
    for i,j,d in zip(coo.row, coo.col, coo.data):
        arr[i,j] = d
    

    3 loops, best of 3: 5.97 ms per loop

    Based on these limited tests:

    • COO matrix: cleanest
    • LIL matrix: fastest
    • CSR matrix: slowest and ugliest. The only good side is that conversion to/from CSR is extremely fast.

    Edit: from @hpaulj I added COO matrix to have all the methods in one place.

    0 讨论(0)
  • 2021-01-06 19:03

    A similar question, but dealing setting sparse values, rather than just reading them:

    Efficient incremental sparse matrix in python / scipy / numpy

    More on accessing values using the underlying representation

    Efficiently select random non-zero column from each row of sparse matrix in scipy

    Also

    why is row indexing of scipy csr matrices slower compared to numpy arrays

    Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?

    Take a look at what M.nonzero does:

        A = self.tocoo()
        nz_mask = A.data != 0
        return (A.row[nz_mask],A.col[nz_mask])
    

    It converts the matrix to coo format and returns the .row, and .col attributes - after filtering out any 'stray' 0s in the .data attribute.

    So you could skip the middle man and use those attributes directly:

     A = lil.tocoo()
     for i,j,d in zip(A.row, A.col, A.data):
          a[i,j] = d
    

    This is almost as good as the toarray:

    In [595]: %%timeit
       .....: aa = M.tocoo()
       .....: for i,j,d in zip(aa.row,aa.col,aa.data):
       .....:   A[i,j]=d
       .....: 
    100 loops, best of 3: 14.3 ms per loop
    
    In [596]: timeit  arr=M.toarray()
    100 loops, best of 3: 12.3 ms per loop
    

    But if your target is really an array, you don't need to iterate

    In [603]: %%timeit
       .....: A=np.empty(M.shape,M.dtype)
       .....: aa=M.tocoo()
       .....: A[aa.row,aa.col]=aa.data
       .....: 
    100 loops, best of 3: 8.22 ms per loop
    

    My times for @Thoran's 2 methods are:

    100 loops, best of 3: 5.81 ms per loop
    100 loops, best of 3: 17.9 ms per loop
    

    Same ballpark of times.

    0 讨论(0)
提交回复
热议问题