I have a Pandas Dataframe of indices and values between 0 and 1, something like this:
6 0.047033
7 0.047650
8 0.054067
9 0.064767
10 0.073183
11 0.
I think this prints what you want. It is based heavily on Joe Kington's answer here I guess it is appropriate to up-vote that.
import numpy as np
# from Joe Kington's answer here https://stackoverflow.com/a/4495197/3751373
# with minor edits
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition,n=1, axis=0)
idx, _ = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right. -JK
# LB this copy to increment is horrible but I get
# ValueError: output array is read-only without it
mutable_idx = np.array(idx)
mutable_idx += 1
idx = mutable_idx
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def main():
import pandas as pd
RUN_LENGTH_THRESHOLD = 5
VALUE_THRESHOLD = 0.5
np.random.seed(seed=901212)
data = np.random.rand(500)*.5 + .35
df = pd.DataFrame(data=data,columns=['values'])
match_bools = df.values > VALUE_THRESHOLD
print('with boolian array')
for start, stop in contiguous_regions(match_bools):
if (stop - start > RUN_LENGTH_THRESHOLD):
print (start, stop)
if __name__ == '__main__':
main()
I would be surprised if there were not more elegant ways
You can find the first and last element of each consecutive region by looking at the series and 1-row shifted values, and then filter the pairs which are adequately apart from each other:
# tag rows based on the threshold
df['tag'] = df['values'] > .5
# first row is a True preceded by a False
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
# last row is a True followed by a False
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
# filter those which are adequately apart
pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4]
so for example the first region would be:
>>> i, j = pr[0]
>>> df.loc[i:j]
indices values tag
15 16 0.639992 True
16 17 0.593427 True
17 18 0.810888 True
18 19 0.596243 True
19 20 0.812684 True
20 21 0.617945 True