Remove Outliers in Pandas DataFrame using Percentiles

后端 未结 4 1885
自闭症患者
自闭症患者 2021-01-30 09:28

I have a DataFrame df with 40 columns and many records.

df:

User_id | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |...| Col39

For e

相关标签:
4条回答
  • 2021-01-30 10:09

    Use this code and don't waste your time:

    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    
    df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
    

    in case you want specific columns:

    cols = ['col_1', 'col_2'] # one or more
    
    Q1 = df[cols].quantile(0.25)
    Q3 = df[cols].quantile(0.75)
    IQR = Q3 - Q1
    
    df = df[~((df[cols] < (Q1 - 1.5 * IQR)) |(df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
    
    0 讨论(0)
  • 2021-01-30 10:12

    What you are describing is similar to the process of winsorizing, which clips values (for example, at the 5th and 95th percentiles) instead of eliminating them completely.

    Here's an example:

    import pandas as pd
    from scipy.stats import mstats
    %matplotlib inline
    
    test_data = pd.Series(range(30))
    test_data.plot()
    

    # Truncate values to the 5th and 95th percentiles
    transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) 
    transformed_test_data.plot()
    

    0 讨论(0)
  • 2021-01-30 10:31

    Use an inner join. Something like this should work

    cols = df.columns.tolist()
    cols.remove('user_id') #remove user_id from list of columns
    
    P = np.percentile(df[cols[0]], [5, 95])
    new_df = df[(df[cols[0] > P[0]) & (df[cols[0]] < P[1])]
    for col in cols[1:]:
        P = np.percentile(df[col], [5, 95])
        new_df = new_df.join(df[(df[col] > P[0]]) & (df[col] < P[1])], how='inner')
    
    0 讨论(0)
  • 2021-01-30 10:33

    The initial dataset.

    print(df.head())
    
       Col0  Col1  Col2  Col3  Col4  User_id
    0    49    31    93    53    39       44
    1    69    13    84    58    24       47
    2    41    71     2    43    58       64
    3    35    56    69    55    36       67
    4    64    24    12    18    99       67
    

    First removing the User_id column

    filt_df = df.loc[:, df.columns != 'User_id']
    

    Then, computing percentiles.

    low = .05
    high = .95
    quant_df = filt_df.quantile([low, high])
    print(quant_df)
    
           Col0   Col1  Col2   Col3   Col4
    0.05   2.00   3.00   6.9   3.95   4.00
    0.95  95.05  89.05  93.0  94.00  97.05
    

    Next filtering values based on computed percentiles. To do that I use an apply by columns and that's it !

    filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) & 
                                        (x < quant_df.loc[high,x.name])], axis=0)
    

    Bringing the User_id back.

    filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)
    

    Last, rows with NaN values can be dropped simply like this.

    filt_df.dropna(inplace=True)
    print(filt_df.head())
    
       User_id  Col0  Col1  Col2  Col3  Col4
    1       47    69    13    84    58    24
    3       67    35    56    69    55    36
    5        9    95    79    44    45    69
    6       83    69    41    66    87     6
    9       87    50    54    39    53    40
    

    Checking result

    print(filt_df.head())
    
       User_id  Col0  Col1  Col2  Col3  Col4
    0       44    49    31   NaN    53    39
    1       47    69    13    84    58    24
    2       64    41    71   NaN    43    58
    3       67    35    56    69    55    36
    4       67    64    24    12    18   NaN
    
    print(filt_df.describe())
    
              User_id       Col0       Col1       Col2       Col3       Col4
    count  100.000000  89.000000  88.000000  88.000000  89.000000  89.000000
    mean    48.230000  49.573034  45.659091  52.727273  47.460674  57.157303
    std     28.372292  25.672274  23.537149  26.509477  25.823728  26.231876
    min      0.000000   3.000000   5.000000   7.000000   4.000000   5.000000
    25%     23.000000  29.000000  29.000000  29.500000  24.000000  36.000000
    50%     47.000000  50.000000  40.500000  52.500000  49.000000  59.000000
    75%     74.250000  69.000000  67.000000  75.000000  70.000000  79.000000
    max     99.000000  95.000000  89.000000  92.000000  91.000000  97.000000
    

    How to generate the test dataset

    np.random.seed(0)
    nb_sample = 100
    num_sample = (0,100)
    
    d = dict()
    d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
    for i in range(5):
        d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
    
    df = DataFrame.from_dict(d)
    
    0 讨论(0)
提交回复
热议问题