I was answering a question about pandas interpolation method. The OP wanted to use only interpolate where the number of consecutive np.nan
s was one. The lim
I really like numba for such easy to grasp but hard to "numpyfy" problems! Even though that package might be a bit too heavy for most libraries it allows to write such "python"-like functions without loosing too much speed:
import numpy as np
import numba as nb
import math
@nb.njit
def mask_nan_if_consecutive(arr, limit): # I'm not good at function names :(
result = np.ones_like(arr)
cnt = 0
for idx in range(len(arr)):
if math.isnan(arr[idx]):
cnt += 1
# If we just reached the limit we need to backtrack,
# otherwise just mask current.
if cnt == limit:
for subidx in range(idx-limit+1, idx+1):
result[subidx] = 0
elif cnt > limit:
result[idx] = 0
else:
cnt = 0
return result
At least if you worked with pure-python this should be quite easy to understand and it should work:
>>> a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
>>> mask_nan_if_consecutive(a, 1)
array([ 1., 0., 0., 0., 1., 0., 1., 1., 0., 0., 1., 1.])
>>> mask_nan_if_consecutive(a, 2)
array([ 1., 0., 0., 0., 1., 1., 1., 1., 0., 0., 1., 1.])
>>> mask_nan_if_consecutive(a, 3)
array([ 1., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1.])
>>> mask_nan_if_consecutive(a, 4)
array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
But the really nice thing about @nb.njit
-decorator is, that this function will be fast:
a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
i = 2
res1 = mask_nan_if_consecutive(a, i)
res2 = mask_knans(a, i)
np.testing.assert_array_equal(res1, res2)
%timeit mask_nan_if_consecutive(a, i) # 100000 loops, best of 3: 6.03 µs per loop
%timeit mask_knans(a, i) # 1000 loops, best of 3: 302 µs per loop
So for short arrays this is approximatly 50 times faster, even though the difference gets lower it's still faster for longer arrays:
a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1]*100000)
i = 2
%timeit mask_nan_if_consecutive(a, i) # 10 loops, best of 3: 20.9 ms per loop
%timeit mask_knans(a, i) # 10 loops, best of 3: 154 ms per loop