Finding outliers in a data set

前端 未结 4 1402
说谎
说谎 2021-02-07 07:38

I have a python script that creates a list of lists of server uptime and performance data, where each sub-list (or \'row\') contains a particular cluster\'s stats. For example,

4条回答
  •  走了就别回头了
    2021-02-07 08:00

    Your stated goal of "finding badness" implies that it is not the outliers that you are looking for, but observations that fall above or below some threshold, and I would presume that the threshold would remain the same over time.

    As an example, if all of your servers were at 98 ± 0.1 % availability, a server at 100% availability would be an outlier, as would a server at 97.6% availability. But these may be within your desired limits.

    On the other hand, there may be good reasons apriori to want to be notified of any server at less than 95% availability, whether or not there is one or many servers below this threshold.

    For this reason, a search for outliers may not provide the information that you are interested in. The thresholds could be determined statistically based on historical data, e.g. by modeling error rate as poisson or percent availability as beta variables. In an applied setting, these thresholds could probably be determined based on performance requirements.

提交回复
热议问题