问题
I am doing SPC analysis using numpy/pandas.
Part of this is checking data series against the Nelson rules and the Western Electric rules.
For instance (rule 2 from the Nelson rules): Check if nine (or more) points in a row are on the same side of the mean.
Now I could simply implement checking a rule like this by iterating over the array.
- But before I do that, I'm checking here on SO if numpy/pandas has a way to do this without iteration?
- In any case: What is the "numpy-ic" way to implement a check like the one described above?
回答1:
As I mentioned in a comment, you may want to try using some stride tricks.
First, let's make an array of the size of your anomalies: we can put it as
np.int8
to save some spaceanomalies = x - x.mean() signs = np.sign(anomalies).astype(np.int8)
Now for the strides. If you want to consider
N
consecutive points, you'll usefrom np.lib.stride_tricks import as_strided strided = as_strided(signs, strides=(signs.itemsize,signs.itemsize), shape=(signs.shape,N))
That gives us a
(x.size, N)
rollin array: the first row isx[0:N]
, the secondx[1:N+1]
... Of course, the lastN-1
rows will be meaningless, so from now on we'll usestrided = strided[:-N+1]
Let's sum along the rows
consecutives = strided.sum(axis=-1)
That gives us an array of size
(x.size-N+1)
of values between-N
and+N
: we just have to find where the absolute values areN
:(indices,) = np.nonzero(consecutives == N)
indices
is the array of the indicesi
of your arrayx
for which the valuesx[i:i+N]
are on the same side of the mean...
Example with x=np.random.rand(10)
and N=3
>>> x = array([ 0.57016436, 0.79360943, 0.89535982, 0.83632245, 0.31046202,
0.91398363, 0.62358298, 0.72148491, 0.99311681, 0.94852957])
>>> signs = np.sign(x-x.mean()).astype(np.int8)
array([-1, 1, 1, 1, -1, 1, -1, -1, 1, 1], dtype=int8)
>>> strided = as_strided(signs,strides=(1,1),shape=(signs.size,3))
array([[ -1, 1, 1],
[ 1, 1, 1],
[ 1, 1, -1],
[ 1, -1, 1],
[ -1, 1, -1],
[ 1, -1, -1],
[ -1, -1, 1],
[ -1, 1, 1],
[ 1, 1, -106],
[ 1, -106, -44]], dtype=int8)
>>> consecutive=strided[:-N+1].sum(axis=-1)
array([ 1, 3, 1, 1, -1, -1, -1, 1])
>>> np.nonzero(np.abs(consecutive)==N)
(array([1]),)
回答2:
import numpy as np
x = np.random.rand(100)
f = np.sign(x - x.mean())
c = np.cumsum(f)
d = c[9:] - c[:-9]
print np.max(d), np.min(d)
if np.max(d) == 9 or np.min(d) == -9 then there are nine (or more) points in a row are on the same side of the mean.
Or you can use following code to calculate the length of every row:
np.diff(np.where(np.diff(np.r_[-2,f,-2]))[0])
回答3:
Given data
and minimal length
, you could check, whether the array
np.diff(np.cumsum(np.sign(data - np.mean(data))), length)
contains zero.
回答4:
another possibility: use correlate or convolve
>>> a = np.random.randn(50)
>>> b = (a - a.mean()) > 0
>>> b.astype(int)
array([0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,
1, 1, 1, 1])
>>> c = np.correlate(b, np.ones(3), mode='valid')
>>> c
array([ 2., 2., 1., 1., 1., 1., 0., 0., 1., 2., 3., 2., 2.,
1., 1., 0., 0., 1., 2., 3., 3., 3., 3., 3., 2., 2.,
2., 2., 2., 1., 1., 1., 1., 2., 1., 2., 2., 2., 1.,
0., 0., 1., 2., 2., 2., 2., 3., 3.])
>>> c.max() == 3
True
>>> c.min() == 0
True
It will be slower than HYRY cumsum version.
aside: there is a runstest in statsmodels for testing similar runs
来源:https://stackoverflow.com/questions/12370349/reasoning-about-consecutive-data-points-without-using-iteration