Efficiently Subtract Vector from Matrix (Scipy)

后端未结

关注

 3  503

野性不改 2020-12-16 05:37

I\'ve got a large matrix stored as a scipy.sparse.csc_matrix and want to subtract a column vector from each one of the columns in the large matrix. This is a pretty common t

3条回答

有刺的猬 (楼主)

2020-12-16 06:05

Summary

So in short, if you use CSR instead of CSC, it's a one-liner:

mat.data -= numpy.repeat(vec.toarray()[0], numpy.diff(mat.indptr))

Explanation

If you realized it, this is better done in row-wise fashion, since we will deduct the same number from each row. In your example then: deduct 1 from the first row, 2 from the second row, 3 from the third row.

I actually encountered this in a real life application where I want to classify documents, each represented as a row in the matrix, while the columns represent words. Each document has a score which should be multiplied to the score of each word in that document. Using row representation of the sparse matrix, I did something similar to this (I modified my code to answer your question):

mat = scipy.sparse.csc_matrix([[1, 2, 3],
                               [2, 3, 4],
                               [3, 4, 5]])

#vec is a 3x1 matrix (or a column vector)
vec = scipy.sparse.csc_matrix([1,2,3]).T

# Use the row version
mat_row = mat.tocsr()
vec_row = vec.T

# mat_row.data contains the values in a 1d array, one-by-one from top left to bottom right in row-wise traversal.
# mat_row.indptr (an n+1 element array) contains the pointer to each first row in the data, and also to the end of the mat_row.data array
# By taking the difference, we basically repeat each element in the row vector to match the number of non-zero elements in each row
mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
print mat_row.todense()

Which results in:

[[0 1 2]
 [0 1 2]
 [0 1 2]]

The visualization is something like this:

>>> mat_row.data
[1 2 3 2 3 4 3 4 5]
>>> mat_row.indptr
[0 3 6 9]
>>> numpy.diff(mat_row.indptr)
[3 3 3]
>>> numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
[1 1 1 2 2 2 3 3 3]
>>> mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
[0 1 2 0 1 2 0 1 2]
>>> mat_row.todense()
[[0 1 2]
 [0 1 2]
 [0 1 2]]

0 讨论(0)

查看其它3个回答