Remove outliers in Pandas dataframe with groupby

前端 未结 2 495
生来不讨喜
生来不讨喜 2020-12-17 05:53

I have a dataframe of Report Date, Time Interval and Total Volume for a full year. I would like to be able to remove outliers within each Time Interval.

This is as f

相关标签:
2条回答
  • 2020-12-17 06:11

    One way is to filter out as follows:

    In [11]: res = df.groupby("Date")["Interval"].quantile([0.05, 0.95]).unstack(level=1)
    
    In [12]: res
    Out[12]:
                 0.05   0.95
    Date
    2016-03-01  489.6  913.4
    

    Now we can lookup these values for each row using loc and filter:

    In [13]: (res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])
    Out[13]:
    Date
    2016-03-01    False
    2016-03-01     True
    2016-03-01     True
    2016-03-01     True
    2016-03-01    False
    dtype: bool
    
    In [14]: df.loc[((res.loc[df.Date, 0.05] < df.Interval.values) & (df.Interval.values < res.loc[df.Date, 0.95])).values]
    Out[14]:
       Report        Date  Time  Interval  Total Volume
    1    5785  2016-03-01    25     580.0           NaN
    2    5786  2016-03-01    26     716.0           NaN
    3    5787  2016-03-01    27     803.0           NaN
    

    Note: grouping by 'Time Interval' will work the same, but in your example doesn't filter any rows!

    0 讨论(0)
  • 2020-12-17 06:12
    df[df.groupby("ReportDate").TotalVolume.\
          transform(lambda x : (x<x.quantile(0.95))&(x>(x.quantile(0.05)))).eq(1)]
    Out[1033]: 
          ReportDate  TimeInterval  TotalVolume
    5785  2016-03-01            25        580.0
    5786  2016-03-01            26        716.0
    5787  2016-03-01            27        803.0
    
    0 讨论(0)
提交回复
热议问题