Efficiently Subtract Vector from Matrix (Scipy)

后端 未结 3 504
野性不改
野性不改 2020-12-16 05:37

I\'ve got a large matrix stored as a scipy.sparse.csc_matrix and want to subtract a column vector from each one of the columns in the large matrix. This is a pretty common t

相关标签:
3条回答
  • 2020-12-16 05:42

    For a start what would we do with dense arrays?

    mat-vec.A # taking advantage of broadcasting
    mat-vec.A[:,[0]*3] # explicit broadcasting
    mat-vec[:,[0,0,0]] # that also works with csr matrix
    

    In https://codereview.stackexchange.com/questions/32664/numpy-scipy-optimization/33566 we found that using as_strided on the mat.indptr vector is the most efficient way of stepping through the rows of a sparse matrix. (The x.rows, x.cols of an lil_matrix are nearly as good. getrow is slow). This function implements such as iteration.

    def sum(X,v):
        rows, cols = X.shape
        row_start_stop = as_strided(X.indptr, shape=(rows, 2),
                                strides=2*X.indptr.strides)
        for row, (start, stop) in enumerate(row_start_stop):
            data = X.data[start:stop]
            data -= v[row]
    
    sum(mat, vec.A)
    print mat.A
    

    I'm using vec.A for simplicity. If we keep vec sparse we'd have to add a test for nonzero value at row. Also this type of iteration only modifies the nonzero elements of mat. 0's are unchanged.

    I suspect the time advantages will depend a lot on the sparsity of matrix and vector. If vec has lots of zeros, then it makes sense to iterate, modifying only those rows of mat where vec is nonzero. But vec is nearly dense like this example, it may be hard to beat mat-vec.A.

    0 讨论(0)
  • 2020-12-16 05:59

    You can introduce fake dimensions by altering the strides of your vector. You can, with out additional allocation, "convert" your vector to a 3 x 3 matrix using np.lib.stride_tricks.as_strided. This page has an example and a bit of a discussion about it along with some discussion of related topics (like views). Search the page for "Example: fake dimensions with strides."

    There are also quite a few example on SO about this... but my searching skills are failing me now.

    0 讨论(0)
  • 2020-12-16 06:05

    Summary

    So in short, if you use CSR instead of CSC, it's a one-liner:

    mat.data -= numpy.repeat(vec.toarray()[0], numpy.diff(mat.indptr))
    

    Explanation

    If you realized it, this is better done in row-wise fashion, since we will deduct the same number from each row. In your example then: deduct 1 from the first row, 2 from the second row, 3 from the third row.

    I actually encountered this in a real life application where I want to classify documents, each represented as a row in the matrix, while the columns represent words. Each document has a score which should be multiplied to the score of each word in that document. Using row representation of the sparse matrix, I did something similar to this (I modified my code to answer your question):

    mat = scipy.sparse.csc_matrix([[1, 2, 3],
                                   [2, 3, 4],
                                   [3, 4, 5]])
    
    #vec is a 3x1 matrix (or a column vector)
    vec = scipy.sparse.csc_matrix([1,2,3]).T
    
    # Use the row version
    mat_row = mat.tocsr()
    vec_row = vec.T
    
    # mat_row.data contains the values in a 1d array, one-by-one from top left to bottom right in row-wise traversal.
    # mat_row.indptr (an n+1 element array) contains the pointer to each first row in the data, and also to the end of the mat_row.data array
    # By taking the difference, we basically repeat each element in the row vector to match the number of non-zero elements in each row
    mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
    print mat_row.todense()
    

    Which results in:

    [[0 1 2]
     [0 1 2]
     [0 1 2]]
    

    The visualization is something like this:

    >>> mat_row.data
    [1 2 3 2 3 4 3 4 5]
    >>> mat_row.indptr
    [0 3 6 9]
    >>> numpy.diff(mat_row.indptr)
    [3 3 3]
    >>> numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
    [1 1 1 2 2 2 3 3 3]
    >>> mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
    [0 1 2 0 1 2 0 1 2]
    >>> mat_row.todense()
    [[0 1 2]
     [0 1 2]
     [0 1 2]]
    
    0 讨论(0)
提交回复
热议问题