This will be a lot easier with the lil
format, and working with rows rather than columns:
In [1]: from scipy import sparse In [2]: A=np.array([[0,0,0,0,0],[0,0,1,0,0],[1,0,0,1,0],[0,0,0,0,1],[1,0,0,0,0]]) In [3]: A Out[3]: array([[0, 0, 0, 0, 0], [0, 0, 1, 0, 0], [1, 0, 0, 1, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0]]) In [4]: At=A.T # switch to work with rows In [5]: M=sparse.lil_matrix(At)
Now it is obvious which row is all zeros
In [6]: M.data Out[6]: array([[1, 1], [], [1], [1], [1]], dtype=object) In [7]: M.rows Out[7]: array([[2, 4], [], [1], [2], [3]], dtype=object)
And lil
format allows us to fill that row:
In [8]: M.data[1]=[1,1,1,1,1] In [9]: M.rows[1]=[0,1,2,3,4] In [10]: M.A Out[10]: array([[0, 0, 1, 0, 1], [1, 1, 1, 1, 1], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 1, 0]], dtype=int32)
I could have also used M[1,:]=np.ones(5,int)
The coo
format is great for creating the array from the data/row/col
arrays, but doesn't implement indexing or math. It has to be transformed to csr
for that. And csc
for column oriented stuff.
The row that I filled isn't so obvious in the csr format:
In [14]: Mc=M.tocsr() In [15]: Mc.data Out[15]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32) In [16]: Mc.indices Out[16]: array([2, 4, 0, 1, 2, 3, 4, 1, 2, 3], dtype=int32) In [17]: Mc.indptr Out[17]: array([ 0, 2, 7, 8, 9, 10], dtype=int32)
On the other hand normalizing is probably easier in this format.
In [18]: Mc.sum(axis=1) Out[18]: matrix([[2], [5], [1], [1], [1]], dtype=int32) In [19]: Mc/Mc.sum(axis=1) Out[19]: matrix([[ 0. , 0. , 0.5, 0. , 0.5], [ 0.2, 0.2, 0.2, 0.2, 0.2], [ 0. , 1. , 0. , 0. , 0. ], [ 0. , 0. , 1. , 0. , 0. ], [ 0. , 0. , 0. , 1. , 0. ]])
Notice that it's converted the sparse matrix to a dense one. The sum
is dense, and math involving sparse and dense usually produces dense.
I have to use a more round about calculation to preserve the sparse status:
In [27]: Mc.multiply(sparse.csr_matrix(1/Mc.sum(axis=1))) Out[27]: <5x5 sparse matrix of type '<class 'numpy.float64'>' with 10 stored elements in Compressed Sparse Row format>
Here's a way of doing this with the csc
format (on A
)
In [40]: Ms=sparse.csc_matrix(A) In [41]: Ms.sum(axis=0) Out[41]: matrix([[2, 0, 1, 1, 1]], dtype=int32)
Use sum
to find the all-zeros column. Obviously this could be wrong if the columns have negative values and happen to sum to 0. If that's a concern I can see making a copy of the matrix with all data
values replaced by 1.
In [43]: Ms[:,1]=np.ones(5,int)[:,None] /usr/lib/python3/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csc_matrix is expensive. lil_matrix is more efficient. SparseEfficiencyWarning) In [44]: Ms.A Out[44]: array([[0, 1, 0, 0, 0], [0, 1, 1, 0, 0], [1, 1, 0, 1, 0], [0, 1, 0, 0, 1], [1, 1, 0, 0, 0]])
The warning matters more if you do this sort of change repeatedly. Notice I have to adjust the dimension of the LHS array. Depending on the number of all-zero columns this action can change the sparsity of the matrix substantially.
==================
I could search the col
of coo
format for missing values with:
In [69]: Mo=sparse.coo_matrix(A) In [70]: Mo.col Out[70]: array([2, 0, 3, 4, 0], dtype=int32) In [71]: Mo.col==np.arange(Mo.shape[1])[:,None] Out[71]: array([[False, True, False, False, True], [False, False, False, False, False], [ True, False, False, False, False], [False, False, True, False, False], [False, False, False, True, False]], dtype=bool) In [72]: idx = np.nonzero(~(Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1))[0] In [73]: idx Out[73]: array([1], dtype=int32)
I could then add a column of 1s at this idx
with:
In [75]: N=Mo.shape[0] In [76]: data = np.concatenate([Mo.data, np.ones(N,int)]) In [77]: row = np.concatenate([Mo.row, np.arange(N)]) In [78]: col = np.concatenate([Mo.col, np.ones(N,int)*idx]) In [79]: Mo1 = sparse.coo_matrix((data,(row, col)), shape=Mo.shape) In [80]: Mo1.A Out[80]: array([[0, 1, 0, 0, 0], [0, 1, 1, 0, 0], [1, 1, 0, 1, 0], [0, 1, 0, 0, 1], [1, 1, 0, 0, 0]])
As written it works for just one column, but it could be generalized to several. I also created a new matrix rather than update Mo
. But this in-place seems to work as well:
Mo.data,Mo.col,Mo.row = data,col,row
The normalization still requires csr
conversion, though I think sparse
can hide that for you.
In [87]: Mo1/Mo1.sum(axis=0) Out[87]: matrix([[ 0. , 0.2, 0. , 0. , 0. ], [ 0. , 0.2, 1. , 0. , 0. ], [ 0.5, 0.2, 0. , 1. , 0. ], [ 0. , 0.2, 0. , 0. , 1. ], [ 0.5, 0.2, 0. , 0. , 0. ]])
Even when I take the extra work of maintaining the sparse nature, I still get a csr
matrix:
In [89]: Mo1.multiply(sparse.coo_matrix(1/Mo1.sum(axis=0))) Out[89]: <5x5 sparse matrix of type '<class 'numpy.float64'>' with 10 stored elements in Compressed Sparse Row format>
See
Find all-zero columns in pandas sparse matrix
for more methods of finding the 0 columns. It turns out Mo.col==np.arange(Mo.shape[1])[:,None]
is too slow with large Mo
. A test using np.in1d
is much better.
1 - np.in1d(np.arange(Mo.shape[1]),Mo.col)