mask only where consecutive nans exceeds x

前端 未结 2 2075
醉梦人生
醉梦人生 2021-02-11 04:47

I was answering a question about pandas interpolation method. The OP wanted to use only interpolate where the number of consecutive np.nans was one. The lim

相关标签:
2条回答
  • 2021-02-11 05:30

    I really like numba for such easy to grasp but hard to "numpyfy" problems! Even though that package might be a bit too heavy for most libraries it allows to write such "python"-like functions without loosing too much speed:

    import numpy as np
    import numba as nb
    import math
    
    @nb.njit
    def mask_nan_if_consecutive(arr, limit):  # I'm not good at function names :(
        result = np.ones_like(arr)
        cnt = 0
        for idx in range(len(arr)):
            if math.isnan(arr[idx]):
                cnt += 1
                # If we just reached the limit we need to backtrack,
                # otherwise just mask current.
                if cnt == limit:
                    for subidx in range(idx-limit+1, idx+1):
                        result[subidx] = 0
                elif cnt > limit:
                    result[idx] = 0
            else:
                cnt = 0
    
        return result
    

    At least if you worked with pure-python this should be quite easy to understand and it should work:

    >>> a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
    >>> mask_nan_if_consecutive(a, 1)
    array([ 1.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,  1.])
    >>> mask_nan_if_consecutive(a, 2)
    array([ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,  1.])
    >>> mask_nan_if_consecutive(a, 3)
    array([ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
    >>> mask_nan_if_consecutive(a, 4)
    array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
    

    But the really nice thing about @nb.njit-decorator is, that this function will be fast:

    a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
    i = 2
    
    res1 = mask_nan_if_consecutive(a, i)
    res2 = mask_knans(a, i)
    np.testing.assert_array_equal(res1, res2)
    
    %timeit mask_nan_if_consecutive(a, i)  # 100000 loops, best of 3: 6.03 µs per loop
    %timeit mask_knans(a, i)               # 1000 loops, best of 3: 302 µs per loop
    

    So for short arrays this is approximatly 50 times faster, even though the difference gets lower it's still faster for longer arrays:

    a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1]*100000)
    i = 2
    
    %timeit mask_nan_if_consecutive(a, i)  # 10 loops, best of 3: 20.9 ms per loop
    %timeit mask_knans(a, i)               # 10 loops, best of 3: 154 ms per loop
    
    0 讨论(0)
  • 2021-02-11 05:38

    I created this generalized solution

    import pandas as pd
    import numpy as np
    from numpy.lib.stride_tricks import as_strided as strided
    
    def mask_knans(a, x):
        a = np.asarray(a)
        k = a.shape[0]
    
        # I will stride n.  I want to pad with 1 less False than
        # the required number of np.nan's
        n = np.append(np.isnan(a), [False] * (x - 1))
    
        # prepare the mask and fill it with True
        m = np.empty(k, np.bool8)
        m.fill(True)
    
        # stride n into a number of columns equal to
        # the required number of np.nan's to mask
        # this is essentially a rolling all operation on isnull
        # also reshape with `[:, None]` in preparation for broadcasting
        # np.where finds the indices where we successfully start
        # x consecutive np.nan's
        s = n.strides[0]
        i = np.where(strided(n, (k + 1 - x, x), (s, s)).all(1))[0][:, None]
    
        # since I prepped with `[:, None]` when I add `np.arange(x)`
        # I'm including the subsequent indices where the remaining
        # x - 1 np.nan's are
        i = i + np.arange(x)
    
        # I use `pd.unique` because it doesn't sort and I don't need to sort
        i = pd.unique(i[i < k])
    
        m[i] = False
    
        return m
    

    w/o comments

    import pandas as pd
    import numpy as np
    from numpy.lib.stride_tricks import as_strided as strided
    
    def mask_knans(a, x):
        a = np.asarray(a)
        k = a.shape[0]
        n = np.append(np.isnan(a), [False] * (x - 1))
        m = np.empty(k, np.bool8)
        m.fill(True)
        s = n.strides[0]
        i = np.where(strided(n, (k + 1 - x, x), (s, s)).all(1))[0][:, None]
        i = i + np.arange(x)
        i = pd.unique(i[i < k])
        m[i] = False
        return m
    

    demo

    mask_knans(a, 2)
    
    [ True False False False  True  True  True  True False False  True  True]
    

    mask_knans(a, 3)
    
    [ True False False False  True  True  True  True  True  True  True  True]
    
    0 讨论(0)
提交回复
热议问题