Is there function that can remove the outliers?

前端 未结 4 917
生来不讨喜
生来不讨喜 2021-01-19 10:17

I find a function to detect outliers from columns but I do not know how to remove the outliers

is there a function for excluding or removing outliers from the colum

相关标签:
4条回答
  • 2021-01-19 10:37

    Here are 2 methods for one-dimentional datasets.

    Part 1: using upper and lower limit to 3 standard deviation

    import numpy as np
    
    # Function to Detection Outlier on one-dimentional datasets.
    anomalies = []
    def find_anomalies(data):
        # Set upper and lower limit to 3 standard deviation
        data_std = np.std(data)
        data_mean = np.mean(data)
        anomaly_cut_off = data_std * 3
    
        lower_limit = data_mean - anomaly_cut_off 
        upper_limit = data_mean + anomaly_cut_off
    
        # Generate outliers
        for outlier in data:
            if outlier > upper_limit or outlier < lower_limit:
                anomalies.append(outlier)
        return anomalies
    
    

    Part 2: Using IQR (interquartile range)

    q1, q3= np.percentile(data,[25,75]) # get percentiles
    iqr = q3 - q1 # the IQR value
    lower_bound = q1 - (1.5 * iqr) # lower bound
    upper_bound = q3 + (1.5 * iqr) # upper bound
    
    np.sum(data > upper_bound) # how many datapoints are above the upper bound?
    
    0 讨论(0)
  • 2021-01-19 10:44

    An easy solution would be to use scipy.stats.zscore

    from scipy.stats import zscore
    # calculates z-score values
    df["zscore"] = zscore(df["Pre_TOTAL_PURCHASE_ADJ"]) 
    
    # creates `is_outlier` column with either True or False values, 
    # so that you could filter your dataframe accordingly
    df["is_outlier"] = df["zscore"].apply(lambda x: x <= -1.96 or x >= 1.96)
    
    0 讨论(0)
  • 2021-01-19 10:45

    I presume that by "remove the outliers" you mean "remove rows from the df dataframe which contain an outlier in the 'Pre_TOTAL_PURCHASE_ADJ' column." If this is incorrect, perhaps you could revise the question to make your meaning clear.

    Sample data are also helpful, rather than forcing would-be answerers to formulate their own.

    It's generally much more efficient to avoid iterating over the rows of a dataframe. For row selections so-called Boolean array indexing is a fast way of achieving your ends. Since you already have a predicate (function returning a truth value) that will identify the rows you want to exclude, you can use such a predicate to build another dataframe that contains only the outliers, or (by negating the predicate) only the non-outliers.

    Since @political_scientist has already given a practical solution using scipy.stats.zscore to produce the predicate values in a new is_outlier column I will leave this answer as simple, general advice for working in numpy and pandas. Given that answer, the rows you want would be given by

    df[~df['is_outlier']]
    

    though it might be slightly more comprehensible to include the negation (~) in the generation of the selector column rather than in the indexing as above, renaming the column 'is_not_outlier'.

    0 讨论(0)
  • 2021-01-19 10:59
    def outlier():
        import pandas as pd
        df1=pd.read_csv("......\\train.csv")
        _, bp = pd.DataFrame.boxplot(df1, return_type='both')
        outliers = [flier.get_ydata() for flier in bp["fliers"]]
        out_liers = [i.tolist() for i in outliers]
    
    0 讨论(0)
提交回复
热议问题