how to use pandas filter with IQR?

前端 未结 6 871
迷失自我
迷失自我 2020-12-28 13:17

Is there a built-in way to do filtering on a column by IQR(i.e. values between Q1-1.5IQR and Q3+1.5IQR)? also, any other possible generalized filtering in pandas suggested

相关标签:
6条回答
  • 2020-12-28 13:54

    This will give you the subset of df which lies in the IQR of column column:

    def subset_by_iqr(df, column, whisker_width=1.5):
        """Remove outliers from a dataframe by column, including optional 
           whiskers, removing rows for which the column value are 
           less than Q1-1.5IQR or greater than Q3+1.5IQR.
        Args:
            df (`:obj:pd.DataFrame`): A pandas dataframe to subset
            column (str): Name of the column to calculate the subset from.
            whisker_width (float): Optional, loosen the IQR filter by a
                                   factor of `whisker_width` * IQR.
        Returns:
            (`:obj:pd.DataFrame`): Filtered dataframe
        """
        # Calculate Q1, Q2 and IQR
        q1 = df[column].quantile(0.25)                 
        q3 = df[column].quantile(0.75)
        iqr = q3 - q1
        # Apply filter with respect to IQR, including optional whiskers
        filter = (df[column] >= q1 - whisker_width*iqr) & (df[column] <= q3 + whisker_width*iqr)
        return df.loc[filter]                                                     
    
    # Example for whiskers = 1.5, as requested by the OP
    df_filtered = subset_by_iqr(df, 'column_name', whisker_width=1.5)
    
    0 讨论(0)
  • 2020-12-28 13:57

    Another approach uses Series.clip:

    q = s.quantile([.25, .75])
    s = s[~s.clip(*q).isin(q)]
    

    here are details:

    s = pd.Series(np.randon.randn(100))
    q = s.quantile([.25, .75])  # calculate lower and upper bounds
    s = s.clip(*q)  # assigns values outside boundary to boundary values
    s = s[~s.isin(q)]  # take only observations within bounds
    

    Using it to filter a whole dataframe df is straightforward:

    def iqr(df, colname, bounds = [.25, .75]):
        s = df[colname]
        q = s.quantile(bounds)
        return df[~s.clip(*q).isin(q)]
    

    Note: the method excludes the boundaries themselves.

    0 讨论(0)
  • 2020-12-28 14:00

    As far as I know, the most compact notation seems to be brought by the query method.

    # Some test data
    np.random.seed(33454)
    df = (
        # A standard distribution
        pd.DataFrame({'nb': np.random.randint(0, 100, 20)})
            # Adding some outliers
            .append(pd.DataFrame({'nb': np.random.randint(100, 200, 2)}))
            # Reseting the index
            .reset_index(drop=True)
        )
    
    # Computing IQR
    Q1 = df['nb'].quantile(0.25)
    Q3 = df['nb'].quantile(0.75)
    IQR = Q3 - Q1
    
    # Filtering Values between Q1-1.5IQR and Q3+1.5IQR
    filtered = df.query('(@Q1 - 1.5 * @IQR) <= nb <= (@Q3 + 1.5 * @IQR)')
    

    Then we can plot the result to check the difference. We observe that the outlier in the left boxplot (the cross at 183) does not appear anymore in the filtered series.

    # Ploting the result to check the difference
    df.join(filtered, rsuffix='_filtered').boxplot()
    

    Since this answer I've written a post on this topic were you may find more information.

    0 讨论(0)
  • 2020-12-28 14:03

    You can try using the below code, also, by calculating IQR. Based on the IQR, lower and upper bound, it will replace the value of outliers presented in each column. this code will go through each columns in data-frame and work one by one by filtering the outliers alone, instead of going through all the values in rows for finding outliers.

    Function:

        def mod_outlier(df):
            df1 = df.copy()
            df = df._get_numeric_data()
    
    
            q1 = df.quantile(0.25)
            q3 = df.quantile(0.75)
    
            iqr = q3 - q1
    
            lower_bound = q1 -(1.5 * iqr) 
            upper_bound = q3 +(1.5 * iqr)
    
    
            for col in col_vals:
                for i in range(0,len(df[col])):
                    if df[col][i] < lower_bound[col]:            
                        df[col][i] = lower_bound[col]
    
                    if df[col][i] > upper_bound[col]:            
                        df[col][i] = upper_bound[col]    
    
    
            for col in col_vals:
                df1[col] = df[col]
    
            return(df1)
    

    Function call:

    df = mod_outlier(df)
    
    0 讨论(0)
  • 2020-12-28 14:06

    Another approach using Series.between():

    iqr = df['col'][df['col'].between(df['col'].quantile(.25), df['col'].quantile(.75), inclusive=True)]
    

    Drawn out:

    q1 = df['col'].quantile(.25)
    q3 = df['col'].quantile(.75)
    mask = d['col'].between(q1, q3, inclusive=True)
    iqr = d.loc[mask, 'col']
    
    0 讨论(0)
  • 2020-12-28 14:09

    Find the 1st and 3rd quartile using df.quantile and then use a mask on the dataframe. In case you want to remove them, use no_outliers and invert the condition in the mask to get outliers.

    Q1 = df.col.quantile(0.25)
    Q3 = df.col.quantile(0.75)
    IQR = Q3 - Q1
    no_outliers = df.col[(Q1 - 1.5*IQR < df.BMI) &  (df.BMI < Q3 + 1.5*IQR)]
    outliers = df.col[(Q1 - 1.5*IQR >= df.BMI) |  (df.BMI >= Q3 + 1.5*IQR)]
    
    0 讨论(0)
提交回复
热议问题