pandas group by remove outliers

后端 未结 2 439
庸人自扰
庸人自扰 2021-01-13 16:35

I want to remove outliers based on percentile 99 values by group wise.

 import pandas as pd
 df = pd.DataFrame({\'Group\': [\'A\',\'A\',\'A\',\'B\',\'B\',\'         


        
相关标签:
2条回答
  • 2021-01-13 16:59

    I don't think you want to use quantile, as you'll exclude your lower values:

    import pandas as pd
    df = pd.DataFrame({'Group': ['A','A','A','B','B','B','B'], 'count': [1.1,11.2,1.1,3.3,3.40,3.3,100.0]})
    print(pd.DataFrame(df.groupby('Group').quantile(.01)['count']))
    

    output:

           count
    Group       
    A        1.1
    B        3.3
    

    Those aren't outliers, right? So you wouldn't want to exclude them.

    You could try setting left and right limits by using standard deviations from the median maybe? This is a bit verbose, but it gives you the right answer:

    left = pd.DataFrame(df.groupby('Group').median() - pd.DataFrame(df.groupby('Group').std()))
    right = pd.DataFrame(df.groupby('Group').median() + pd.DataFrame(df.groupby('Group').std()))
    
    left.columns = ['left']
    right.columns = ['right']
    
    df = df.merge(left, left_on='Group', right_index=True)
    df = df.merge(right, left_on='Group', right_index=True)
    
    df = df[(df['count'] > df['left']) & (df['count'] < df['right'])]
    df = df.drop(['left', 'right'], axis=1)
    print(df)
    

    output:

      Group  count
    0     A    1.1
    2     A    1.1
    3     B    3.3
    4     B    3.4
    5     B    3.3
    
    0 讨论(0)
  • 2021-01-13 17:13

    Here is my solution:

    def is_outlier(s):
        lower_limit = s.mean() - (s.std() * 3)
        upper_limit = s.mean() + (s.std() * 3)
        return ~s.between(lower_limit, upper_limit)
    
    df = df[~df.groupby('Group')['count'].apply(is_outlier)]
    

    You can write your own is_outlier function

    0 讨论(0)
提交回复
热议问题