Filtering out outliers in Pandas dataframe with rolling median

前端 未结 3 696
花落未央
花落未央 2021-01-13 11:20

I am trying to filter out some outliers from a scatter plot of GPS elevation displacements with dates

I\'m trying to use df.rolling to compute a median and standard

相关标签:
3条回答
  • 2021-01-13 11:45

    Just filter the dataframe

    df['median']= df['b'].rolling(window).median()
    df['std'] = df['b'].rolling(window).std()
    
    #filter setup
    df = df[(df.b <= df['median']+3*df['std']) & (df.b >= df['median']-3*df['std'])]
    
    0 讨论(0)
  • 2021-01-13 11:46

    This is my take on creating a median filter:

    def median_filter(num_std=3):
        def _median_filter(x):
            _median = np.median(x)
            _std = np.std(x)
            s = x[-1]
            return s if s >= _median - num_std * _std and s <= _median + num_std * _std else np.nan
        return _median_filter
    
    df.y.rolling(window).apply(median_filter(num_std=3), raw=True)
    
    0 讨论(0)
  • 2021-01-13 11:47

    There might well be a more pandastic way to do this - this is a bit of a hack, relying on a sorta manual way of mapping the original df's index to each rolling window. (I picked size 6). The records up and until row 6 are associated with the first window; row 7 is the second window, and so on.

    n = 100
    df = pd.DataFrame(np.random.randint(0,n,size=(n,2)), columns = ['a','b'])
    
    ## set window size
    window=6
    std = 1  # I set it at just 1; with real data and larger windows, can be larger
    
    ## create df with rolling stats, upper and lower bounds
    bounds = pd.DataFrame({'median':df['b'].rolling(window).median(),
    'std':df['b'].rolling(window).std()})
    
    bounds['upper']=bounds['median']+bounds['std']*std
    bounds['lower']=bounds['median']-bounds['std']*std
    
    ## here, we set an identifier for each window which maps to the original df
    ## the first six rows are the first window; then each additional row is a new window
    bounds['window_id']=np.append(np.zeros(window),np.arange(1,n-window+1))
    
    ## then we can assign the original 'b' value back to the bounds df
    bounds['b']=df['b']
    
    ## and finally, keep only rows where b falls within the desired bounds
    bounds.loc[bounds.eval("lower<b<upper")]
    
    0 讨论(0)
提交回复
热议问题