Delimiting contiguous regions with values above a certain threshold in Pandas DataFrame

后端未结

关注

 2  1512

I have a Pandas Dataframe of indices and values between 0 and 1, something like this:

 6  0.047033
 7  0.047650
 8  0.054067
 9  0.064767
10  0.073183
11  0.


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  忘掉有多难        
                
              
                            
                2021-01-05 00:03
              
            
            
                                                                       
I think this prints what you want. It is based heavily on Joe Kington's answer here I guess it is appropriate to up-vote that.

import numpy as np

# from Joe Kington's answer here https://stackoverflow.com/a/4495197/3751373
# with minor edits
def contiguous_regions(condition):
    """Finds contiguous True regions of the boolean array "condition". Returns
    a 2D array where the first column is the start index of the region and the
    second column is the end index."""

    # Find the indicies of changes in "condition"
    d = np.diff(condition,n=1, axis=0)
    idx, _ = d.nonzero() 

    # We need to start things after the change in "condition". Therefore, 
    # we'll shift the index by 1 to the right. -JK
    # LB this copy to increment is horrible but I get 
    # ValueError: output array is read-only without it 

    mutable_idx = np.array(idx)
    mutable_idx +=  1
    idx = mutable_idx

    if condition[0]:
        # If the start of condition is True prepend a 0
        idx = np.r_[0, idx]

    if condition[-1]:
        # If the end of condition is True, append the length of the array
        idx = np.r_[idx, condition.size] # Edit

    # Reshape the result into two columns
    idx.shape = (-1,2)
    return idx

def main():
    import pandas as pd
    RUN_LENGTH_THRESHOLD = 5
    VALUE_THRESHOLD = 0.5

    np.random.seed(seed=901212)
    data = np.random.rand(500)*.5 + .35

    df = pd.DataFrame(data=data,columns=['values'])

    match_bools =  df.values > VALUE_THRESHOLD 


    print('with boolian array')
    for start, stop in contiguous_regions(match_bools):
        if (stop - start > RUN_LENGTH_THRESHOLD):
            print (start, stop)



if __name__ == '__main__':
    main()


I would be surprised if there were not more elegant ways
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  太阳男子        
                
              
                            
                2021-01-05 00:04
              
            
            
                                                                       
You can find the first and last element of each consecutive region by looking at the series and 1-row shifted values, and then filter the pairs which are adequately apart from each other:

# tag rows based on the threshold
df['tag'] = df['values'] > .5

# first row is a True preceded by a False
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]

# last row is a True followed by a False
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]

# filter those which are adequately apart
pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4]


so for example the first region would be:

>>> i, j = pr[0]
>>> df.loc[i:j]
    indices    values   tag
15       16  0.639992  True
16       17  0.593427  True
17       18  0.810888  True
18       19  0.596243  True
19       20  0.812684  True
20       21  0.617945  True

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复