Recommended anomaly detection technique for simple, one-dimensional scenario?

后端 未结 3 1227
无人共我
无人共我 2020-12-07 14:45

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is

相关标签:
3条回答
  • 2020-12-07 15:25

    Check out the three-sigma rule:

    mu  = mean of the data
    std = standard deviation of the data
    IF abs(x-mu) > 3*std  THEN  x is outlier
    

    An alternative method is the IQR outlier test:

    Q25 = 25th_percentile
    Q75 = 75th_percentile
    IQR = Q75 - Q25         // inter-quartile range
    IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN  x is a mild outlier
    IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN  x is an extreme outlier
    

    this test is usually employed by Box plots (indicated by the whiskers):

    boxplot


    EDIT:

    For your case (simple 1D univariate data), I think my first answer is well suited. That however isn't applicable to multivariate data.

    @smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.

    A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.

    dbscan_clustering

    DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.

    If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.

    If the number of neighbors is less than minPoints, the point is marked as noise.

    If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.

    Finally the set of all points marked as noise are considered outliers.

    0 讨论(0)
  • 2020-12-07 15:27

    Both three-sigma rule and IQR test are often used, and there are a couple of simple algorithms to detect anomalies.

    The three-sigma rule is correct
    mu  = mean of the data
    std = standard deviation of the data
    IF abs(x-mu) > 3*std  THEN  x is outlier
    

    The IQR test should be:

    Q25 = 25th_percentile
    Q75 = 75th_percentile
    IQR = Q75 - Q25         // inter-quartile range
    If x >  Q75  + 1.5 * IQR or  x   < Q25 - 1.5 * IQR THEN  x is a mild outlier
    If x >  Q75  + 3.0 * IQR or  x   < Q25 – 3.0 * IQR THEN  x is a extreme outlier
    
    0 讨论(0)
  • 2020-12-07 15:34

    There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.

    After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by @Amro as a good starting point.

    For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.

    0 讨论(0)
提交回复
热议问题