Why are Numpy masked arrays useful?

余生长醉 提交于 2020-07-18 05:59:48

问题


I've been reading through the masked array documentation and I'm confused - what is different about MaskedArray than just maintaining an array of values and a boolean mask? Can someone give me an example where MaskedArrays are way more convenient, or higher performing?

Update 6/5

To be more concrete about my question, here is the classic example of how one uses a MaskedArray:

>>>data = np.arange(12).reshape(3, 4)
>>>mask = np.array([[0., 0., 1., 0.],
                    [0., 0., 0., 1.],
                    [0., 1., 0., 0.]])

>>>masked = np.ma.array(data, mask=mask)
>>>masked

masked_array(
  data=[[0, 1, --, 3],
        [4, 5, 6, --],
        [8, --, 10, 11]],
  mask=[[False, False,  True, False],
        [False, False, False,  True],
        [False,  True, False, False]],
  fill_value=999999)

>>>masked.sum(axis=0)

masked_array(data=[12, 6, 16, 14], mask=[False, False, False, False], fill_value=999999)

I could just as easily well do the same thing this way:

>>>data = np.arange(12).reshape(3, 4).astype(float)
>>>mask = np.array([[0., 0., 1., 0.],
                    [0., 0., 0., 1.],
                    [0., 1., 0., 0.]]).astype(bool)

>>>masked = data.copy()  # this keeps the original data reuseable, as would
                         # the MaskedArray. If we only need to perform one 
                         # operation then we could avoid the copy
>>>masked[mask] = np.nan
>>>np.nansum(masked, axis=0)

array([12.,  6., 16., 14.])

I suppose the MaskedArray version looks a bit nicer, and avoids the copy if you need a reuseable array. Doesn't it use just as much memory when converting from standard ndarray to MaskedArray? And does it avoid the copy under the hood when applying the mask to the data? Are there other advantages?


回答1:


The official answer is reported here:

In theory, IEEE nan was specifically designed to address the problem of missing values, but the reality is that different platforms behave differently, making life more difficult. On some platforms, the presence of nan slows calculations 10-100 times. For integer data, no nan value exists.

In fact, masked arrays can be quite slow compared to the analogous array of nans:

import numpy as np
g = np.random.random((5000,5000))
indx = np.random.randint(0,4999,(500,2))
g_nan = g.copy()
g_nan[indx] = np.nan
mask =  np.full((5000,5000),False,dtype=bool)
mask[indx] = True
g_mask = np.ma.array(g,mask=mask)

%timeit (g_mask + g_mask)**2
1.27 s ± 35.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(g_nan + g_nan)**2
%timeit (g_nan + g_nan)**2
76.5 ms ± 715 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

When are they useful?

In many years of programming, I found them useful on the following occasions:

  • when you want to preserve the values you masked for later processing, without copying the array.
  • you don't want to get tricked by the strange behaviour of nan operations (you might be tricked by the behaviour of masked array by the way).
  • when you have to handle many arrays with their masks if the mask is part of the array you avoid code and confusions.
  • you can assign different meaning to the masked value compared to the nan value. For instance, I use np.nan for missing values but I mask also the value with poor SNR, so I can identify both.

In general, you can consider masked array as a more compact representation. The best approach is to test case by case the more comprehensible and efficient solution.



来源:https://stackoverflow.com/questions/55987642/why-are-numpy-masked-arrays-useful

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!