Making pyplot.hist() first and last bins include outliers

前端 未结 2 2193
余生分开走
余生分开走 2021-02-12 14:30

pyplot.hist() documentation specifies that when setting a range for a histogram \"lower and upper outliers are ignored\".

Is it possible to make th

相关标签:
2条回答
  • 2021-02-12 14:56

    No. Looking at matplotlib.axes.Axes.hist and the direct use of numpy.histogram I'm fairly confident in saying that there is no smarter solution than using clip (other than extending the bins that you histogram with).

    I'd encourage you to look at the source of matplotlib.axes.Axes.hist (it's just Python code, though admittedly hist is slightly more complex than most of the Axes methods) - it is the best way to verify this kind of question.

    HTH

    0 讨论(0)
  • 2021-02-12 15:05

    I was also struggling with this, and didn't want to use .clip() because it could be misleading, so I wrote a little function (borrowing heavily from this) to indicate that the upper and lower bins contained outliers:

    def outlier_aware_hist(data, lower=None, upper=None):
        if not lower or lower < data.min():
            lower = data.min()
            lower_outliers = False
        else:
            lower_outliers = True
    
        if not upper or upper > data.max():
            upper = data.max()
            upper_outliers = False
        else:
            upper_outliers = True
    
        n, bins, patches = plt.hist(data, range=(lower, upper), bins='auto')
    
        if lower_outliers:
            n_lower_outliers = (data < lower).sum()
            patches[0].set_height(patches[0].get_height() + n_lower_outliers)
            patches[0].set_facecolor('c')
            patches[0].set_label('Lower outliers: ({:.2f}, {:.2f})'.format(data.min(), lower))
    
        if upper_outliers:
            n_upper_outliers = (data > upper).sum()
            patches[-1].set_height(patches[-1].get_height() + n_upper_outliers)
            patches[-1].set_facecolor('m')
            patches[-1].set_label('Upper outliers: ({:.2f}, {:.2f})'.format(upper, data.max()))
    
        if lower_outliers or upper_outliers:
            plt.legend()
    

    You can also combine it with an automatic outlier detector (borrowed from here) like so:

    def mad(data):
        median = np.median(data)
        diff = np.abs(data - median)
        mad = np.median(diff)
        return mad
    
    def calculate_bounds(data, z_thresh=3.5):
        MAD = mad(data)
        median = np.median(data)
        const = z_thresh * MAD / 0.6745
        return (median - const, median + const)
    
    outlier_aware_hist(data, *calculate_bounds(data))
    

    0 讨论(0)
提交回复
热议问题