mask only where consecutive nans exceeds x

前端未结
关注
 2  2083
醉梦人生 2021-02-11 04:47
I was answering a question about pandas interpolation method. The OP wanted to use only interpolate where the number of consecutive np.nans was one. The lim

      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   小蘑菇
                                             
                
                
                (楼主)
            
              
              
                2021-02-11 05:30
              

            
            
                        
I really like numba for such easy to grasp but hard to "numpyfy" problems! Even though that package might be a bit too heavy for most libraries it allows to write such "python"-like functions without loosing too much speed:

import numpy as np
import numba as nb
import math

@nb.njit
def mask_nan_if_consecutive(arr, limit):  # I'm not good at function names :(
    result = np.ones_like(arr)
    cnt = 0
    for idx in range(len(arr)):
        if math.isnan(arr[idx]):
            cnt += 1
            # If we just reached the limit we need to backtrack,
            # otherwise just mask current.
            if cnt == limit:
                for subidx in range(idx-limit+1, idx+1):
                    result[subidx] = 0
            elif cnt > limit:
                result[idx] = 0
        else:
            cnt = 0

    return result


At least if you worked with pure-python this should be quite easy to understand and it should work:

>>> a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
>>> mask_nan_if_consecutive(a, 1)
array([ 1.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,  1.])
>>> mask_nan_if_consecutive(a, 2)
array([ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,  1.])
>>> mask_nan_if_consecutive(a, 3)
array([ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
>>> mask_nan_if_consecutive(a, 4)
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])


But the really nice thing about @nb.njit-decorator is, that this function will be fast:

a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
i = 2

res1 = mask_nan_if_consecutive(a, i)
res2 = mask_knans(a, i)
np.testing.assert_array_equal(res1, res2)

%timeit mask_nan_if_consecutive(a, i)  # 100000 loops, best of 3: 6.03 µs per loop
%timeit mask_knans(a, i)               # 1000 loops, best of 3: 302 µs per loop


So for short arrays this is approximatly 50 times faster, even though the difference gets lower it's still faster for longer arrays:

a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1]*100000)
i = 2

%timeit mask_nan_if_consecutive(a, i)  # 10 loops, best of 3: 20.9 ms per loop
%timeit mask_knans(a, i)               # 10 loops, best of 3: 154 ms per loop

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复