So, I\'m doing some Kmeans classification using numpy arrays that are quite sparse-- lots and lots of zeroes. I figured that I\'d use scipy\'s \'sparse\' package to reduce
Use a scipy.sparse
format that is row or column based: csc_matrix
and csr_matrix
.
These use efficient, C implementations under the hood (including multiplication), and transposition is a no-op (esp. if you call transpose(copy=False)
), just like with numpy arrays.
EDIT: some timings via ipython:
import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% sparse vector
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))
Now x_csr
and x_dok
are 50% sparse:
print repr(x_csr)
<1x100000 sparse matrix of type '<type 'numpy.float64'>'
with 49757 stored elements in Compressed Sparse Row format>
And the timings:
timeit numpy.dot(x, x)
10000 loops, best of 3: 123 us per loop
timeit x_dok * x_dok.T
1 loops, best of 3: 1.73 s per loop
timeit x_csr.multiply(x_csr).sum()
1000 loops, best of 3: 1.64 ms per loop
timeit x_csr * x_csr.T
100 loops, best of 3: 3.62 ms per loop
So it looks like I told a lie. Transposition is very cheap, but there is no efficient C implementation of csr * csc (in the latest scipy 0.9.0). A new csr object is constructed in each call :-(
As a hack (though scipy is relatively stable these days), you can do the dot product directly on the sparse data:
timeit numpy.dot(x_csr.data, x_csr.data)
10000 loops, best of 3: 62.9 us per loop
Note this last approach does a numpy dense multiplication again. The sparsity is 50%, so it's actually faster than dot(x, x)
by a factor of 2.
I'm not sure that it is really much better or faster, but you could do this to avoid using the transpose:
Asp.multiply(Bsp).sum()
This just takes the element-by-element product of the two matrices and sums the products. You could make a subclass of whatever matrix format you are using that has the above statement as the dot product.
However, it is probably just easier to tranpose them:
Asp*Bsp.T
That doesn't seem like so much to do, but you could also make a subclass and modify the mul() method.
You could create a subclass of one of the existing 2d sparse arrays
from scipy.sparse import dok_matrix
class sparse1d(dok_matrix):
def __init__(self, v):
dok_matrix.__init__(self, (v,))
def dot(self, other):
return dok_matrix.dot(self, other.transpose())[0,0]
a=sparse1d((1,2,3))
b=sparse1d((4,5,6))
print a.dot(b)