I\'ve got a large matrix stored as a scipy.sparse.csc_matrix and want to subtract a column vector from each one of the columns in the large matrix. This is a pretty common t
For a start what would we do with dense arrays?
mat-vec.A # taking advantage of broadcasting
mat-vec.A[:,[0]*3] # explicit broadcasting
mat-vec[:,[0,0,0]] # that also works with csr matrix
In https://codereview.stackexchange.com/questions/32664/numpy-scipy-optimization/33566
we found that using as_strided
on the mat.indptr
vector is the most efficient way of stepping through the rows of a sparse matrix. (The x.rows
, x.cols
of an lil_matrix
are nearly as good. getrow
is slow). This function implements such as iteration.
def sum(X,v):
rows, cols = X.shape
row_start_stop = as_strided(X.indptr, shape=(rows, 2),
strides=2*X.indptr.strides)
for row, (start, stop) in enumerate(row_start_stop):
data = X.data[start:stop]
data -= v[row]
sum(mat, vec.A)
print mat.A
I'm using vec.A
for simplicity. If we keep vec
sparse we'd have to add a test for nonzero value at row
. Also this type of iteration only modifies the nonzero elements of mat
. 0's
are unchanged.
I suspect the time advantages will depend a lot on the sparsity of matrix and vector. If vec
has lots of zeros, then it makes sense to iterate, modifying only those rows of mat
where vec
is nonzero. But vec
is nearly dense like this example, it may be hard to beat mat-vec.A
.
You can introduce fake dimensions by altering the strides
of your vector. You can, with out additional allocation, "convert" your vector to a 3 x 3 matrix using np.lib.stride_tricks.as_strided
. This page has an example and a bit of a discussion about it along with some discussion of related topics (like views). Search the page for "Example: fake dimensions with strides."
There are also quite a few example on SO about this... but my searching skills are failing me now.
So in short, if you use CSR instead of CSC, it's a one-liner:
mat.data -= numpy.repeat(vec.toarray()[0], numpy.diff(mat.indptr))
If you realized it, this is better done in row-wise fashion, since we will deduct the same number from each row. In your example then: deduct 1 from the first row, 2 from the second row, 3 from the third row.
I actually encountered this in a real life application where I want to classify documents, each represented as a row in the matrix, while the columns represent words. Each document has a score which should be multiplied to the score of each word in that document. Using row representation of the sparse matrix, I did something similar to this (I modified my code to answer your question):
mat = scipy.sparse.csc_matrix([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
#vec is a 3x1 matrix (or a column vector)
vec = scipy.sparse.csc_matrix([1,2,3]).T
# Use the row version
mat_row = mat.tocsr()
vec_row = vec.T
# mat_row.data contains the values in a 1d array, one-by-one from top left to bottom right in row-wise traversal.
# mat_row.indptr (an n+1 element array) contains the pointer to each first row in the data, and also to the end of the mat_row.data array
# By taking the difference, we basically repeat each element in the row vector to match the number of non-zero elements in each row
mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
print mat_row.todense()
Which results in:
[[0 1 2] [0 1 2] [0 1 2]]
The visualization is something like this:
>>> mat_row.data
[1 2 3 2 3 4 3 4 5]
>>> mat_row.indptr
[0 3 6 9]
>>> numpy.diff(mat_row.indptr)
[3 3 3]
>>> numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
[1 1 1 2 2 2 3 3 3]
>>> mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
[0 1 2 0 1 2 0 1 2]
>>> mat_row.todense()
[[0 1 2]
[0 1 2]
[0 1 2]]