问题
I need to find very large jumps only so that I can find clusters and later the noise as well. The sample data is as under:
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.012400
I need to do this in python, but any generic algorithm would be welcome as well.
I have already tried
- Finding distance between each consecutive pair of points.
- Find the ratio of consecutive distances.
- Find the closeness of the consecutive ratios.
The problem I face is when I use the compare function numpy.allclose()
, its approximation factor is static and for varying degree of jumps, it stops working and gives false positives and false negatives.
Some of the graphs for data visualization. The bottom graph in each is the total number of points.
回答1:
First, you should visualise your problem to get a better understanding what's going on:
import matplotlib.pyplot as plt
data = (0.000000, 0.000500, 0.001500, 0.003000, 0.005500, 0.008700,
0.012400, 0.000000, 0.000500, 0.001500, 0.003000, 0.005500,
0.008700, 0.012400, 0.000000, 0.000500, 0.001500, 0.003000,
0.005500, 0.008700, 0.012400, 0.000000, 0.000500, 0.001500,
0.003000, 0.005500, 0.008700, 0.012400, 0.000000, 0.000500,
0.001500, 0.003000, 0.005500, 0.008700, 0.012400, 0.000000,
0.000500, 0.001500, 0.003000, 0.005500, 0.008700, 0.012400,
0.012400)
plt.scatter(range(len(data)), data)
Second, you need to implement a step detection, which is well described on the wiki: http://en.wikipedia.org/wiki/Step_detection
Choose a method you think would fit best and play around with it.
UPDATE
Just a thought: if all your data look similar to your example, you could also simply try to make a sawtooth wave (http://en.wikipedia.org/wiki/Sawtooth_wave) least square fit (http://en.wikipedia.org/wiki/Least_squares) to find the "jumps". This could be a starting point for further analysis.
回答2:
Your approach works if you compute the closeness "manually" this way:
import numpy as np
data = np.array([0.000000, 0.000500, 0.001500, 0.003000, 0.005500, 0.008700,
0.012400, 0.000000, 0.000500, 0.001500, 0.003000, 0.005500,
0.008700, 0.012400, 0.000000, 0.000500, 0.001500, 0.003000,
0.005500, 0.008700, 0.012400, 0.000000, 0.000500, 0.001500,
0.003000, 0.005500, 0.008700, 0.012400, 0.000000, 0.000500,
0.001500, 0.003000, 0.005500, 0.008700, 0.012400, 0.000000,
0.000500, 0.001500, 0.003000, 0.005500, 0.008700, 0.012400,
0.012400])
steps = data[1:] - data[:-1]
ratios = 1. * steps[1:] / steps[:-1]
jumps = ratios[1:] - ratios[:-1]
largest_jumps = np.max(jumps)
print largest_jumps
>> 3.31102877071
回答3:
This is nothing fancy, but you could try it. Using forward, backward differences you can detect single outliers. It will fail in complex cases if multiple outliers lay together tho, but for simple cases it might work:
import numpy as np
x = np.arange(20)
# Synthetic data
sample = np.random.randn(20)
# Synthetic noise
sample[np.random.randint(0, 20, 5)] += np.random.randn(5) * 100
plot(x, sample, 'o')
Get forward and backward derivatives (their absolute value as we care about magnitude of differences not direction):
d1 = np.r_[0, np.abs(sample[1:] - sample[:-1])]
d2 = np.r_[np.abs(sample[1:] - sample[:-1]), 0]
Mask the inliers with a threshold (they have at least another node close):
mask = (d1 < 5) | (d2 < 5)
Show results:
plot(x[mask], samples[mask], 'o')
The hardcoded 5
for the example could be replaced by the mean
or the median
+ std
or something.
Again, is not something fancy, as @septi pointed out, there is a lot of theory in outlier detection and no easy approach solves every problem. For more information you can check density based outlier deetection, which I think is suitable for your problem.
来源:https://stackoverflow.com/questions/30366142/finding-very-large-jumps-in-data