Finding outliers in a data set

前端 未结 4 1410
说谎
说谎 2021-02-07 07:38

I have a python script that creates a list of lists of server uptime and performance data, where each sub-list (or \'row\') contains a particular cluster\'s stats. For example,

相关标签:
4条回答
  • 2021-02-07 07:44

    I think your best bet is to have a look into the scipy's scoreatpercentile function. So for instance you could try excluding all the values that are above the 99th percentile.

    Mean and standard deviation are no good if you don't have a normal distribution.

    Generally it's good to have a rough visual idea of what your data looks like. There is matplotlib; I recommend you make some plots of your data with it before deciding on a plan.

    0 讨论(0)
  • 2021-02-07 07:48

    You need to calculate the Mean (Average) and Standard Deviation for the column. Stadard deviation is a bit confusing, but the important fact is that 2/3 of the data is within

    Mean +/- StandardDeviation

    Generally anything outside Mean +/- 2 * StandardDeviation is an outlier, but you can tweak the multiplier.

    http://en.wikipedia.org/wiki/Standard_deviation

    So to be clear, you want to convert the data to standard deviations from the mean.

    ie

    def getdeviations(x, mean, stddev):
       return math.abs(x - mean) / stddev
    

    Numpy has functions for this.

    0 讨论(0)
  • 2021-02-07 07:51

    One good way of identifying outliers visually is to make a boxplot (or box-and-whiskers plot), which will show the median, and a couple of quartiles above and below the median, and the points that lie "far" from this box (see Wikipedia entry http://en.wikipedia.org/wiki/Box_plot). In R, there's a boxplot function to do just that.

    One way to discard/identify outliers programmatically is to use the MAD, or Median Absolute Deviation. The MAD is not sensitive to outliers, unlike the standard deviation. I sometimes use a rule of thumb to consider all points that are more than 5*MAD away from the median, to be outliers.

    0 讨论(0)
  • 2021-02-07 08:00

    Your stated goal of "finding badness" implies that it is not the outliers that you are looking for, but observations that fall above or below some threshold, and I would presume that the threshold would remain the same over time.

    As an example, if all of your servers were at 98 ± 0.1 % availability, a server at 100% availability would be an outlier, as would a server at 97.6% availability. But these may be within your desired limits.

    On the other hand, there may be good reasons apriori to want to be notified of any server at less than 95% availability, whether or not there is one or many servers below this threshold.

    For this reason, a search for outliers may not provide the information that you are interested in. The thresholds could be determined statistically based on historical data, e.g. by modeling error rate as poisson or percent availability as beta variables. In an applied setting, these thresholds could probably be determined based on performance requirements.

    0 讨论(0)
提交回复
热议问题