Finding very large jumps in data

拜拜、爱过 提交于 2020-01-03 01:45:08

问题


I need to find very large jumps only so that I can find clusters and later the noise as well. The sample data is as under:

0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.012400

I need to do this in python, but any generic algorithm would be welcome as well.

I have already tried

  1. Finding distance between each consecutive pair of points.
  2. Find the ratio of consecutive distances.
  3. Find the closeness of the consecutive ratios.

The problem I face is when I use the compare function numpy.allclose() , its approximation factor is static and for varying degree of jumps, it stops working and gives false positives and false negatives.

Some of the graphs for data visualization. The bottom graph in each is the total number of points.


回答1:


First, you should visualise your problem to get a better understanding what's going on:

import matplotlib.pyplot as plt
data = (0.000000, 0.000500, 0.001500, 0.003000, 0.005500, 0.008700,
        0.012400, 0.000000, 0.000500, 0.001500, 0.003000, 0.005500,
        0.008700, 0.012400, 0.000000, 0.000500, 0.001500, 0.003000,
        0.005500, 0.008700, 0.012400, 0.000000, 0.000500, 0.001500,
        0.003000, 0.005500, 0.008700, 0.012400, 0.000000, 0.000500,
        0.001500, 0.003000, 0.005500, 0.008700, 0.012400, 0.000000,
        0.000500, 0.001500, 0.003000, 0.005500, 0.008700, 0.012400, 
        0.012400)
plt.scatter(range(len(data)), data)

Second, you need to implement a step detection, which is well described on the wiki: http://en.wikipedia.org/wiki/Step_detection

Choose a method you think would fit best and play around with it.

UPDATE

Just a thought: if all your data look similar to your example, you could also simply try to make a sawtooth wave (http://en.wikipedia.org/wiki/Sawtooth_wave) least square fit (http://en.wikipedia.org/wiki/Least_squares) to find the "jumps". This could be a starting point for further analysis.




回答2:


Your approach works if you compute the closeness "manually" this way:

import numpy as np
data = np.array([0.000000, 0.000500, 0.001500, 0.003000, 0.005500, 0.008700,
        0.012400, 0.000000, 0.000500, 0.001500, 0.003000, 0.005500,
        0.008700, 0.012400, 0.000000, 0.000500, 0.001500, 0.003000,
        0.005500, 0.008700, 0.012400, 0.000000, 0.000500, 0.001500,
        0.003000, 0.005500, 0.008700, 0.012400, 0.000000, 0.000500,
        0.001500, 0.003000, 0.005500, 0.008700, 0.012400, 0.000000,
        0.000500, 0.001500, 0.003000, 0.005500, 0.008700, 0.012400, 
        0.012400])
steps = data[1:] - data[:-1]
ratios = 1. * steps[1:] / steps[:-1]
jumps = ratios[1:] - ratios[:-1]
largest_jumps = np.max(jumps)
print largest_jumps

>> 3.31102877071



回答3:


This is nothing fancy, but you could try it. Using forward, backward differences you can detect single outliers. It will fail in complex cases if multiple outliers lay together tho, but for simple cases it might work:

import numpy as np
x = np.arange(20)    
# Synthetic data
sample = np.random.randn(20)
# Synthetic noise
sample[np.random.randint(0, 20, 5)] += np.random.randn(5) * 100

plot(x, sample, 'o')

Get forward and backward derivatives (their absolute value as we care about magnitude of differences not direction):

d1 = np.r_[0, np.abs(sample[1:] - sample[:-1])]
d2 = np.r_[np.abs(sample[1:] - sample[:-1]), 0]

Mask the inliers with a threshold (they have at least another node close):

mask = (d1 < 5) | (d2 < 5)

Show results:

plot(x[mask], samples[mask], 'o')

The hardcoded 5 for the example could be replaced by the mean or the median + std or something.

Again, is not something fancy, as @septi pointed out, there is a lot of theory in outlier detection and no easy approach solves every problem. For more information you can check density based outlier deetection, which I think is suitable for your problem.



来源:https://stackoverflow.com/questions/30366142/finding-very-large-jumps-in-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!