I have a matrix of size N*M and I want to find the mean value for each row. The values are from 1 to 5 and entries that do not have any value are set to 0. However, when I w
I will detail here the more general solution that uses a masked array. To illustrate the details let's create an lower triangular matrix with only ones:
matrix = np.tril(np.ones((5, 5)), 0)
If you the terminology above is not clear this matrix looks like this:
[[ 1., 0., 0., 0., 0.],
[ 1., 1., 0., 0., 0.],
[ 1., 1., 1., 0., 0.],
[ 1., 1., 1., 1., 0.],
[ 1., 1., 1., 1., 1.]]
Now, we want our function to return an average of 1 for each of rows. Or in other words that the mean over the axis 1 is equal to a vector of five ones. In order to achieve this we created a masked matrix where the entries whose values are zero are considered invalid. This can be achieved withnp.ma.masked_equal
:
masked = np.ma.masked_equal(matrix, 0)
Finally we perform numpy operations in this array that will systematically ignore the masked elements (the 0's). With this in mind we obtain the desired result by:
masked.mean(axis=1)
This should produce a vector whose entries are only ones.
In more detail the output of np.ma.masked_equal(matrix, 0)
should look like this:
masked_array(data =
[[1.0 -- -- -- --]
[1.0 1.0 -- -- --]
[1.0 1.0 1.0 -- --]
[1.0 1.0 1.0 1.0 --]
[1.0 1.0 1.0 1.0 1.0]],
mask =
[[False True True True True]
[False False True True True]
[False False False True True]
[False False False False True]
[False False False False False]],
fill_value = 0.0)
This indicates that eh values on --
are considered invalid. This is also shown in the mask attribute of the masked arrays as True which indicates that IT IS an invalid element and therefore should be ignored.
Finally the output of the mean operation on this array should is:
masked_array(data = [1.0 1.0 1.0 1.0 1.0],
mask = [False False False False False],
fill_value = 1e+20)
Get the count of non-zeros in each row and use that for averaging the summation along each row. Thus, the implementation would look something like this -
np.true_divide(matrix.sum(1),(matrix!=0).sum(1))
If you are on an older version of NumPy, you can use float conversion of the count to replace np.true_divide
, like so -
matrix.sum(1)/(matrix!=0).sum(1).astype(float)
Sample run -
In [160]: matrix
Out[160]:
array([[0, 0, 1, 0, 2],
[1, 0, 0, 2, 0],
[0, 1, 1, 0, 0],
[0, 2, 2, 2, 2]])
In [161]: np.true_divide(matrix.sum(1),(matrix!=0).sum(1))
Out[161]: array([ 1.5, 1.5, 1. , 2. ])
Another way to solve the problem would be to replace zeros with NaNs
and then use np.nanmean
, which would ignore those NaNs
and in effect those original zeros
, like so -
np.nanmean(np.where(matrix!=0,matrix,np.nan),1)
From performance point of view, I would recommend the first approach.