Scipy sparse… arrays?

前端 未结 3 710
感动是毒
感动是毒 2020-12-13 03:44

So, I\'m doing some Kmeans classification using numpy arrays that are quite sparse-- lots and lots of zeroes. I figured that I\'d use scipy\'s \'sparse\' package to reduce

相关标签:
3条回答
  • 2020-12-13 04:05

    Use a scipy.sparse format that is row or column based: csc_matrix and csr_matrix.

    These use efficient, C implementations under the hood (including multiplication), and transposition is a no-op (esp. if you call transpose(copy=False)), just like with numpy arrays.

    EDIT: some timings via ipython:

    import numpy, scipy.sparse
    n = 100000
    x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% sparse vector
    x_csr = scipy.sparse.csr_matrix(x)
    x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))
    

    Now x_csr and x_dok are 50% sparse:

    print repr(x_csr)
    <1x100000 sparse matrix of type '<type 'numpy.float64'>'
            with 49757 stored elements in Compressed Sparse Row format>
    

    And the timings:

    timeit numpy.dot(x, x)
    10000 loops, best of 3: 123 us per loop
    
    timeit x_dok * x_dok.T
    1 loops, best of 3: 1.73 s per loop
    
    timeit x_csr.multiply(x_csr).sum()
    1000 loops, best of 3: 1.64 ms per loop
    
    timeit x_csr * x_csr.T
    100 loops, best of 3: 3.62 ms per loop
    

    So it looks like I told a lie. Transposition is very cheap, but there is no efficient C implementation of csr * csc (in the latest scipy 0.9.0). A new csr object is constructed in each call :-(

    As a hack (though scipy is relatively stable these days), you can do the dot product directly on the sparse data:

    timeit numpy.dot(x_csr.data, x_csr.data)
    10000 loops, best of 3: 62.9 us per loop
    

    Note this last approach does a numpy dense multiplication again. The sparsity is 50%, so it's actually faster than dot(x, x) by a factor of 2.

    0 讨论(0)
  • 2020-12-13 04:12

    I'm not sure that it is really much better or faster, but you could do this to avoid using the transpose:

    Asp.multiply(Bsp).sum()
    

    This just takes the element-by-element product of the two matrices and sums the products. You could make a subclass of whatever matrix format you are using that has the above statement as the dot product.

    However, it is probably just easier to tranpose them:

    Asp*Bsp.T
    

    That doesn't seem like so much to do, but you could also make a subclass and modify the mul() method.

    0 讨论(0)
  • 2020-12-13 04:24

    You could create a subclass of one of the existing 2d sparse arrays

    from scipy.sparse import dok_matrix
    
    class sparse1d(dok_matrix):
        def __init__(self, v):
            dok_matrix.__init__(self, (v,))
        def dot(self, other):
            return dok_matrix.dot(self, other.transpose())[0,0]
    
    a=sparse1d((1,2,3))
    b=sparse1d((4,5,6))
    print a.dot(b)
    
    0 讨论(0)
提交回复
热议问题