I have a dataframe of Report Date, Time Interval and Total Volume for a full year. I would like to be able to remove outliers within each Time Interval.
This is as f
One way is to filter out as follows:
In [11]: res = df.groupby("Date")["Interval"].quantile([0.05, 0.95]).unstack(level=1)
In [12]: res
Out[12]:
0.05 0.95
Date
2016-03-01 489.6 913.4
Now we can lookup these values for each row using loc
and filter:
In [13]: (res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])
Out[13]:
Date
2016-03-01 False
2016-03-01 True
2016-03-01 True
2016-03-01 True
2016-03-01 False
dtype: bool
In [14]: df.loc[((res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])).values]
Out[14]:
Report Date Time Interval Total Volume
1 5785 2016-03-01 25 580.0 NaN
2 5786 2016-03-01 26 716.0 NaN
3 5787 2016-03-01 27 803.0 NaN
Note: grouping by 'Time Interval' will work the same, but in your example doesn't filter any rows!
df[df.groupby("ReportDate").TotalVolume.\
transform(lambda x : (x<x.quantile(0.95))&(x>(x.quantile(0.05)))).eq(1)]
Out[1033]:
ReportDate TimeInterval TotalVolume
5785 2016-03-01 25 580.0
5786 2016-03-01 26 716.0
5787 2016-03-01 27 803.0