I am trying to filter out inner data in my large data frame(1400,000 rows).
This is a very short and easy version of sample data,
a b c dt
If you only want a new column with the values that match with mask, as @Quang Hoang said, you could try this:
import pandas as pd
import io
s_e='''
a b c dt e
35 0.1 234 2020/6/15 14:27:00 0
1 0.1 554 2020/6/15 15:28:00 1
2 0.2 654 2020/6/15 16:29:00 0
23 0.4 2345 2020/6/15 17:26:00 0
34 0.8 245 2020/6/15 18:25:00 0
8 0.9 123 2020/6/15 18:26:00 0
7 0.1 22 2020/6/15 18:27:00 0
2 0.3 99 2020/6/15 18:28:00 0
219 0.2 17 2020/6/15 19:26:00 0
'''
df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', parse_dates=[3], engine='python')
print(df)
# masking the first set of conditions:
mask = (df['a'].lt(25) & df['a'].gt(10) ) | df['b'].gt(0.2) | df['c'].gt(500)
mask = mask & df['e'].eq(0)
#Quang Hoang recomendation:
df['indicator'] = mask.astype(int)
print(df)
Output:
df
a b c dt e indicator
0 35 0.1 234 2020-06-15 14:27:00 0 0
1 1 0.1 554 2020-06-15 15:28:00 1 0
2 2 0.2 654 2020-06-15 16:29:00 0 1
3 23 0.4 2345 2020-06-15 17:26:00 0 1
4 34 0.8 245 2020-06-15 18:25:00 0 1
5 8 0.9 123 2020-06-15 18:26:00 0 1
6 7 0.1 22 2020-06-15 18:27:00 0 0
7 2 0.3 99 2020-06-15 18:28:00 0 1
8 219 0.2 17 2020-06-15 19:26:00 0 0
If you want to indicate only the rows that match with the mask and also with min c
values by 30 mins, you could try:
import pandas as pd
import io
s_e='''
a b c dt e
35 0.1 234 2020/6/15 14:27:00 0
1 0.1 554 2020/6/15 15:28:00 1
2 0.2 654 2020/6/15 16:29:00 0
23 0.4 2345 2020/6/15 17:26:00 0
34 0.8 245 2020/6/15 18:25:00 0
8 0.9 123 2020/6/15 18:26:00 0
7 0.1 22 2020/6/15 18:27:00 0
2 0.3 99 2020/6/15 18:28:00 0
219 0.2 17 2020/6/15 19:26:00 0
'''
df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', parse_dates=[3], engine='python')
print(df)
# masking the first set of conditions:
mask = (df['a'].lt(25) & df['a'].gt(10) ) | df['b'].gt(0.2) | df['c'].gt(500)
mask = mask & df['e'].eq(0)
df['indicator'] = [0]*len(df)
temp = df[mask]
#select rows with min `c` values by 30 mins
c_min = temp.groupby(temp['dt'].dt.floor('30min'))['c'].idxmin()
# final df
df.loc[c_min,'indicator']=1
print(df)
Output:
df
a b c dt e indicator
0 35 0.1 234 2020-06-15 14:27:00 0 0
1 1 0.1 554 2020-06-15 15:28:00 1 0
2 2 0.2 654 2020-06-15 16:29:00 0 1
3 23 0.4 2345 2020-06-15 17:26:00 0 1
4 34 0.8 245 2020-06-15 18:25:00 0 0
5 8 0.9 123 2020-06-15 18:26:00 0 0
6 7 0.1 22 2020-06-15 18:27:00 0 0
7 2 0.3 99 2020-06-15 18:28:00 0 1
8 219 0.2 17 2020-06-15 19:26:00 0 0