pandas how to find continuous values in a series whose differences are within a certain distance

后端未结

关注

 2  605

独厮守ぢ 2021-02-15 02:44

I have a pandas Series that is composed of ints

a = np.array([1,2,3,5,7,10,13,16,20])
pd.Series(a)

0  1
1  2
2  3
3  5
4


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   轻奢々
                                             
                
                
                (楼主)
            
              
              
                2021-02-15 03:22
              

            
            
                        
Here's one approach -

np.split(a,np.flatnonzero(np.diff(a)>d)+1)


As a function to output list of lists -

def splitme(a,d) : 
    return list(map(list,np.split(a,np.flatnonzero(np.diff(a)>d)+1)))


For performance, I would suggest using zip to get the start, stop indices and then slicing, thus avoiding np.split which might prove to be the bottleneck -

def splitme_zip(a,d) : 
    m = np.concatenate(([True],a[1:] > a[:-1] + d,[True]))
    idx = np.flatnonzero(m)
    l = a.tolist()
    return [l[i:j] for i,j in zip(idx[:-1],idx[1:])]


If you need the output as a list of arrays, skip the list conversion with  .tolist/map(list,).

Sample runs -

In [122]: a = np.array([1,2,3,5,7,10,13,16,20])

In [123]: splitme(a,1)
Out[123]: [[1, 2, 3], [5], [7], [10], [13], [16], [20]]

In [124]: splitme(a,2)
Out[124]: [[1, 2, 3, 5, 7], [10], [13], [16], [20]]

In [125]: splitme(a,3)
Out[125]: [[1, 2, 3, 5, 7, 10, 13, 16], [20]]


Runtime test -

In [180]: a = np.sort(np.random.randint(1,10000*2,(10000)))

In [181]: s = pd.Series(a)

In [182]: d = 3

In [183]: %timeit pandas_way(s,d) #@cᴏʟᴅsᴘᴇᴇᴅ's soln
10 loops, best of 3: 55.1 ms per loop

In [184]: %timeit np.split(a,np.flatnonzero(np.diff(a)>d)+1)
     ...: %timeit splitme(a,d)
     ...: %timeit splitme_zip(a,d)
1000 loops, best of 3: 1.47 ms per loop
100 loops, best of 3: 2.87 ms per loop
1000 loops, best of 3: 516 µs per loop

In [185]: a
Out[185]: array([    2,     2,     2, ..., 19992, 19996, 19999])

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复