Efficiently Subtract Vector from Matrix (Scipy)

后端 未结 3 503
野性不改
野性不改 2020-12-16 05:37

I\'ve got a large matrix stored as a scipy.sparse.csc_matrix and want to subtract a column vector from each one of the columns in the large matrix. This is a pretty common t

3条回答
  •  有刺的猬
    2020-12-16 06:05

    Summary

    So in short, if you use CSR instead of CSC, it's a one-liner:

    mat.data -= numpy.repeat(vec.toarray()[0], numpy.diff(mat.indptr))
    

    Explanation

    If you realized it, this is better done in row-wise fashion, since we will deduct the same number from each row. In your example then: deduct 1 from the first row, 2 from the second row, 3 from the third row.

    I actually encountered this in a real life application where I want to classify documents, each represented as a row in the matrix, while the columns represent words. Each document has a score which should be multiplied to the score of each word in that document. Using row representation of the sparse matrix, I did something similar to this (I modified my code to answer your question):

    mat = scipy.sparse.csc_matrix([[1, 2, 3],
                                   [2, 3, 4],
                                   [3, 4, 5]])
    
    #vec is a 3x1 matrix (or a column vector)
    vec = scipy.sparse.csc_matrix([1,2,3]).T
    
    # Use the row version
    mat_row = mat.tocsr()
    vec_row = vec.T
    
    # mat_row.data contains the values in a 1d array, one-by-one from top left to bottom right in row-wise traversal.
    # mat_row.indptr (an n+1 element array) contains the pointer to each first row in the data, and also to the end of the mat_row.data array
    # By taking the difference, we basically repeat each element in the row vector to match the number of non-zero elements in each row
    mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
    print mat_row.todense()
    

    Which results in:

    [[0 1 2]
     [0 1 2]
     [0 1 2]]
    

    The visualization is something like this:

    >>> mat_row.data
    [1 2 3 2 3 4 3 4 5]
    >>> mat_row.indptr
    [0 3 6 9]
    >>> numpy.diff(mat_row.indptr)
    [3 3 3]
    >>> numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
    [1 1 1 2 2 2 3 3 3]
    >>> mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr))
    [0 1 2 0 1 2 0 1 2]
    >>> mat_row.todense()
    [[0 1 2]
     [0 1 2]
     [0 1 2]]
    

提交回复
热议问题