Scipy sparse… arrays?

前端未结

关注

 3  710

So, I\'m doing some Kmeans classification using numpy arrays that are quite sparse-- lots and lots of zeroes. I figured that I\'d use scipy\'s \'sparse\' package to reduce

相关标签:

3条回答

醉话见心

2020-12-13 04:05
Use a scipy.sparse format that is row or column based: csc_matrix and csr_matrix.

These use efficient, C implementations under the hood (including multiplication), and transposition is a no-op (esp. if you call transpose(copy=False)), just like with numpy arrays.

EDIT: some timings via ipython:
```
import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% sparse vector
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))
```
Now x_csr and x_dok are 50% sparse:
```
print repr(x_csr)
<1x100000 sparse matrix of type '<type 'numpy.float64'>'
        with 49757 stored elements in Compressed Sparse Row format>
```
And the timings:
```
timeit numpy.dot(x, x)
10000 loops, best of 3: 123 us per loop

timeit x_dok * x_dok.T
1 loops, best of 3: 1.73 s per loop

timeit x_csr.multiply(x_csr).sum()
1000 loops, best of 3: 1.64 ms per loop

timeit x_csr * x_csr.T
100 loops, best of 3: 3.62 ms per loop
```
So it looks like I told a lie. Transposition is very cheap, but there is no efficient C implementation of csr * csc (in the latest scipy 0.9.0). A new csr object is constructed in each call :-(

As a hack (though scipy is relatively stable these days), you can do the dot product directly on the sparse data:
```
timeit numpy.dot(x_csr.data, x_csr.data)
10000 loops, best of 3: 62.9 us per loop
```
Note this last approach does a numpy dense multiplication again. The sparsity is 50%, so it's actually faster than dot(x, x) by a factor of 2.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-13 04:12
I'm not sure that it is really much better or faster, but you could do this to avoid using the transpose:
```
Asp.multiply(Bsp).sum()
```
This just takes the element-by-element product of the two matrices and sums the products. You could make a subclass of whatever matrix format you are using that has the above statement as the dot product.

However, it is probably just easier to tranpose them:
```
Asp*Bsp.T
```
That doesn't seem like so much to do, but you could also make a subclass and modify the mul() method.
0 讨论(0)
发布评论:

提交评论
- 加载中...

太阳男子

2020-12-13 04:24

You could create a subclass of one of the existing 2d sparse arrays

from scipy.sparse import dok_matrix

class sparse1d(dok_matrix):
    def __init__(self, v):
        dok_matrix.__init__(self, (v,))
    def dot(self, other):
        return dok_matrix.dot(self, other.transpose())[0,0]

a=sparse1d((1,2,3))
b=sparse1d((4,5,6))
print a.dot(b)

0 讨论(0)