Extend numpy mask by n cells to the right for each bad value, efficiently

前端未结

关注

 7  1359

[愿得一人] 2021-02-15 15:39

Let\'s say I have a length 30 array with 4 bad values in it. I want to create a mask for those bad values, but since I will be using rolling window functions, I\'d also like a f

7条回答

时光说笑 (楼主)

2021-02-15 15:42
A few years late, but I've come up with a fully vectorized solution that requires no loops or copies (besides the mask itself). This solution is a bit (potentially) dangerous because it uses numpy.lib.stride_tricks.as_strided. It's also not as fast as @swentzel's solution.

The idea is to take the mask and create a 2D view of it, where the second dimension is just the elements that follow the current element. Then you can just set an entire column to True if the head is True. Since you are dealing with a view, setting a column will actually set the following elements in the mask.

Start with the data:
```
import numpy as np
a = np.array([4, 0, 8, 5, 10, 9, np.nan, 1, 4, 9, 9, np.nan, np.nan, 9,\
              9, 8, 0, 3, 7, 9, 2, 6, 7, 2, 9, 4, 1, 1, np.nan, 10])
n = 3
```
Now, we will make the mask a.size + n elements long, so that you don't have to process the last n elements manually:
```
mask = np.empty(a.size + n, dtype=np.bool)
np.isnan(a, out=mask[:a.size])
mask[a.size:] = False
```
Now the cool part:
```
view = np.lib.stride_tricks.as_strided(mask, shape=(n + 1, a.size),
                                       strides=mask.strides * 2)
```
That last part is crucial. mask.strides is a tuple like (1,) (since bools are usually about that many bytes across. Doubling it means that you take a 1-byte step to move one element in any dimension.

Now all you need to do is expand the mask:
```
view[1:, view[0]] = True
```
That's it. Now mask has what you want. Keep in mind that this only works because the assignment index precedes the last changed value. You could not get away with view[1:] |= view[0].

For benching purposes, it appears that the definition of n has changed from the question, so the following function takes that into account:
```
def madphysicist0(a, n):
    m = np.empty(a.size + n - 1, dtype=np.bool)
    np.isnan(a, out=m[:a.size])
    m[a.size:] = False

    v = np.lib.stride_tricks.as_strided(m, shape=(n, a.size), strides=m.strides * 2)
    v[1:, v[0]] = True
    return v[0]
```
V2

Taking a leaf out of the existing fastest answer, we only need to copy log₂(n) rows, not n rows:
```
def madphysicist1(a, n):
    m = np.empty(a.size + n - 1, dtype=np.bool)
    np.isnan(a, out=m[:a.size])
    m[a.size:] = False

    v = np.lib.stride_tricks.as_strided(m, shape=(n, a.size), strides=m.strides * 2)

    stop = int(np.log2(n))
    for k in range(1, stop + 1):
        v[k, v[0]] = True
    if (1<
```
Since this doubles the size of the mask at every iteration, it works a bit faster than Fibonacci: https://math.stackexchange.com/q/894743/295281
0 讨论(0) 查看其它7个回答发布评论: 提交评论加载中...