Delimiting contiguous regions with values above a certain threshold in Pandas DataFrame

后端 未结 2 1512
迷失自我
迷失自我 2021-01-04 23:29

I have a Pandas Dataframe of indices and values between 0 and 1, something like this:

 6  0.047033
 7  0.047650
 8  0.054067
 9  0.064767
10  0.073183
11  0.         


        
相关标签:
2条回答
  • 2021-01-05 00:03

    I think this prints what you want. It is based heavily on Joe Kington's answer here I guess it is appropriate to up-vote that.

    import numpy as np
    
    # from Joe Kington's answer here https://stackoverflow.com/a/4495197/3751373
    # with minor edits
    def contiguous_regions(condition):
        """Finds contiguous True regions of the boolean array "condition". Returns
        a 2D array where the first column is the start index of the region and the
        second column is the end index."""
    
        # Find the indicies of changes in "condition"
        d = np.diff(condition,n=1, axis=0)
        idx, _ = d.nonzero() 
    
        # We need to start things after the change in "condition". Therefore, 
        # we'll shift the index by 1 to the right. -JK
        # LB this copy to increment is horrible but I get 
        # ValueError: output array is read-only without it 
    
        mutable_idx = np.array(idx)
        mutable_idx +=  1
        idx = mutable_idx
    
        if condition[0]:
            # If the start of condition is True prepend a 0
            idx = np.r_[0, idx]
    
        if condition[-1]:
            # If the end of condition is True, append the length of the array
            idx = np.r_[idx, condition.size] # Edit
    
        # Reshape the result into two columns
        idx.shape = (-1,2)
        return idx
    
    def main():
        import pandas as pd
        RUN_LENGTH_THRESHOLD = 5
        VALUE_THRESHOLD = 0.5
    
        np.random.seed(seed=901212)
        data = np.random.rand(500)*.5 + .35
    
        df = pd.DataFrame(data=data,columns=['values'])
    
        match_bools =  df.values > VALUE_THRESHOLD 
    
    
        print('with boolian array')
        for start, stop in contiguous_regions(match_bools):
            if (stop - start > RUN_LENGTH_THRESHOLD):
                print (start, stop)
    
    
    
    if __name__ == '__main__':
        main()
    

    I would be surprised if there were not more elegant ways

    0 讨论(0)
  • 2021-01-05 00:04

    You can find the first and last element of each consecutive region by looking at the series and 1-row shifted values, and then filter the pairs which are adequately apart from each other:

    # tag rows based on the threshold
    df['tag'] = df['values'] > .5
    
    # first row is a True preceded by a False
    fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
    
    # last row is a True followed by a False
    lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
    
    # filter those which are adequately apart
    pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4]
    

    so for example the first region would be:

    >>> i, j = pr[0]
    >>> df.loc[i:j]
        indices    values   tag
    15       16  0.639992  True
    16       17  0.593427  True
    17       18  0.810888  True
    18       19  0.596243  True
    19       20  0.812684  True
    20       21  0.617945  True
    
    0 讨论(0)
提交回复
热议问题