scipy.sparse.coo_matrix how to fast find all zeros column, fill with 1 and normalize

前端 未结 1 1796
滥情空心
滥情空心 2021-01-24 06:09

For a matrix, i want to find columns with all zeros and fill with 1s, and then normalize the matrix by column. I know how to do that with np.arrays

[[0 0 0 0 0]
         


        
相关标签:
1条回答
  • 2021-01-24 06:28

    This will be a lot easier with the lil format, and working with rows rather than columns:

    In [1]: from scipy import sparse
    In [2]: A=np.array([[0,0,0,0,0],[0,0,1,0,0],[1,0,0,1,0],[0,0,0,0,1],[1,0,0,0,0]])
    In [3]: A
    Out[3]: 
    array([[0, 0, 0, 0, 0],
           [0, 0, 1, 0, 0],
           [1, 0, 0, 1, 0],
           [0, 0, 0, 0, 1],
           [1, 0, 0, 0, 0]])
    In [4]: At=A.T                # switch to work with rows
    
    In [5]: M=sparse.lil_matrix(At)
    

    Now it is obvious which row is all zeros

    In [6]: M.data
    Out[6]: array([[1, 1], [], [1], [1], [1]], dtype=object)
    In [7]: M.rows
    Out[7]: array([[2, 4], [], [1], [2], [3]], dtype=object)
    

    And lil format allows us to fill that row:

    In [8]: M.data[1]=[1,1,1,1,1]
    In [9]: M.rows[1]=[0,1,2,3,4]
    In [10]: M.A
    Out[10]: 
    array([[0, 0, 1, 0, 1],
           [1, 1, 1, 1, 1],
           [0, 1, 0, 0, 0],
           [0, 0, 1, 0, 0],
           [0, 0, 0, 1, 0]], dtype=int32)
    

    I could have also used M[1,:]=np.ones(5,int)

    The coo format is great for creating the array from the data/row/col arrays, but doesn't implement indexing or math. It has to be transformed to csr for that. And csc for column oriented stuff.

    The row that I filled isn't so obvious in the csr format:

    In [14]: Mc=M.tocsr()
    In [15]: Mc.data
    Out[15]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
    In [16]: Mc.indices
    Out[16]: array([2, 4, 0, 1, 2, 3, 4, 1, 2, 3], dtype=int32)
    In [17]: Mc.indptr
    Out[17]: array([ 0,  2,  7,  8,  9, 10], dtype=int32)
    

    On the other hand normalizing is probably easier in this format.

    In [18]: Mc.sum(axis=1)
    Out[18]: 
    matrix([[2],
            [5],
            [1],
            [1],
            [1]], dtype=int32)
    In [19]: Mc/Mc.sum(axis=1)
    Out[19]: 
    matrix([[ 0. ,  0. ,  0.5,  0. ,  0.5],
            [ 0.2,  0.2,  0.2,  0.2,  0.2],
            [ 0. ,  1. ,  0. ,  0. ,  0. ],
            [ 0. ,  0. ,  1. ,  0. ,  0. ],
            [ 0. ,  0. ,  0. ,  1. ,  0. ]])
    

    Notice that it's converted the sparse matrix to a dense one. The sum is dense, and math involving sparse and dense usually produces dense.

    I have to use a more round about calculation to preserve the sparse status:

    In [27]: Mc.multiply(sparse.csr_matrix(1/Mc.sum(axis=1)))
    Out[27]: 
    <5x5 sparse matrix of type '<class 'numpy.float64'>'
        with 10 stored elements in Compressed Sparse Row format>
    

    Here's a way of doing this with the csc format (on A)

    In [40]: Ms=sparse.csc_matrix(A)
    In [41]: Ms.sum(axis=0)
    Out[41]: matrix([[2, 0, 1, 1, 1]], dtype=int32)
    

    Use sum to find the all-zeros column. Obviously this could be wrong if the columns have negative values and happen to sum to 0. If that's a concern I can see making a copy of the matrix with all data values replaced by 1.

    In [43]: Ms[:,1]=np.ones(5,int)[:,None]
    /usr/lib/python3/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csc_matrix is expensive. lil_matrix is more efficient.
      SparseEfficiencyWarning)
    In [44]: Ms.A
    Out[44]: 
    array([[0, 1, 0, 0, 0],
           [0, 1, 1, 0, 0],
           [1, 1, 0, 1, 0],
           [0, 1, 0, 0, 1],
           [1, 1, 0, 0, 0]])
    

    The warning matters more if you do this sort of change repeatedly. Notice I have to adjust the dimension of the LHS array. Depending on the number of all-zero columns this action can change the sparsity of the matrix substantially.

    ==================

    I could search the col of coo format for missing values with:

    In [69]: Mo=sparse.coo_matrix(A)
    In [70]: Mo.col
    Out[70]: array([2, 0, 3, 4, 0], dtype=int32)
    
    In [71]: Mo.col==np.arange(Mo.shape[1])[:,None]
    Out[71]: 
    array([[False,  True, False, False,  True],
           [False, False, False, False, False],
           [ True, False, False, False, False],
           [False, False,  True, False, False],
           [False, False, False,  True, False]], dtype=bool)
    
    In [72]: idx = np.nonzero(~(Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1))[0]
    In [73]: idx
    Out[73]: array([1], dtype=int32)
    

    I could then add a column of 1s at this idx with:

    In [75]: N=Mo.shape[0]
    In [76]: data = np.concatenate([Mo.data, np.ones(N,int)])
    In [77]: row = np.concatenate([Mo.row, np.arange(N)])
    In [78]: col = np.concatenate([Mo.col, np.ones(N,int)*idx])
    In [79]: Mo1 = sparse.coo_matrix((data,(row, col)), shape=Mo.shape)
    In [80]: Mo1.A
    Out[80]: 
    array([[0, 1, 0, 0, 0],
           [0, 1, 1, 0, 0],
           [1, 1, 0, 1, 0],
           [0, 1, 0, 0, 1],
           [1, 1, 0, 0, 0]])
    

    As written it works for just one column, but it could be generalized to several. I also created a new matrix rather than update Mo. But this in-place seems to work as well:

    Mo.data,Mo.col,Mo.row = data,col,row
    

    The normalization still requires csr conversion, though I think sparse can hide that for you.

    In [87]: Mo1/Mo1.sum(axis=0)
    Out[87]: 
    matrix([[ 0. ,  0.2,  0. ,  0. ,  0. ],
            [ 0. ,  0.2,  1. ,  0. ,  0. ],
            [ 0.5,  0.2,  0. ,  1. ,  0. ],
            [ 0. ,  0.2,  0. ,  0. ,  1. ],
            [ 0.5,  0.2,  0. ,  0. ,  0. ]])
    

    Even when I take the extra work of maintaining the sparse nature, I still get a csr matrix:

    In [89]: Mo1.multiply(sparse.coo_matrix(1/Mo1.sum(axis=0)))
    Out[89]: 
    <5x5 sparse matrix of type '<class 'numpy.float64'>'
        with 10 stored elements in Compressed Sparse Row format>
    

    See

    Find all-zero columns in pandas sparse matrix

    for more methods of finding the 0 columns. It turns out Mo.col==np.arange(Mo.shape[1])[:,None] is too slow with large Mo. A test using np.in1d is much better.

    1 - np.in1d(np.arange(Mo.shape[1]),Mo.col)
    
    0 讨论(0)
提交回复
热议问题