Reasoning about consecutive data points without using iteration

问题

I am doing SPC analysis using numpy/pandas.

Part of this is checking data series against the Nelson rules and the Western Electric rules.

For instance (rule 2 from the Nelson rules): Check if nine (or more) points in a row are on the same side of the mean.

Now I could simply implement checking a rule like this by iterating over the array.

But before I do that, I'm checking here on SO if numpy/pandas has a way to do this without iteration?
In any case: What is the "numpy-ic" way to implement a check like the one described above?

回答1:

As I mentioned in a comment, you may want to try using some stride tricks.

First, let's make an array of the size of your anomalies: we can put it as np.int8 to save some space
```
anomalies = x - x.mean()
signs = np.sign(anomalies).astype(np.int8)
```
Now for the strides. If you want to consider N consecutive points, you'll use
```
from np.lib.stride_tricks import as_strided
strided = as_strided(signs, 
                     strides=(signs.itemsize,signs.itemsize), 
                     shape=(signs.shape,N))
```
That gives us a (x.size, N) rollin array: the first row is x[0:N], the second x[1:N+1]... Of course, the last N-1 rows will be meaningless, so from now on we'll use
```
strided = strided[:-N+1]
```
Let's sum along the rows
```
consecutives = strided.sum(axis=-1)
```
That gives us an array of size (x.size-N+1) of values between -N and +N: we just have to find where the absolute values are N:
```
(indices,) = np.nonzero(consecutives == N)
```
indices is the array of the indices i of your array x for which the values x[i:i+N] are on the same side of the mean...

Example with x=np.random.rand(10) and N=3

>>> x = array([ 0.57016436,  0.79360943,  0.89535982,  0.83632245,  0.31046202,
            0.91398363,  0.62358298,  0.72148491,  0.99311681,  0.94852957])
>>> signs = np.sign(x-x.mean()).astype(np.int8)
array([-1,  1,  1,  1, -1,  1, -1, -1,  1,  1], dtype=int8)
>>> strided = as_strided(signs,strides=(1,1),shape=(signs.size,3))
array([[  -1,    1,    1],
       [   1,    1,    1],
       [   1,    1,   -1],
       [   1,   -1,    1],
       [  -1,    1,   -1],
       [   1,   -1,   -1],
       [  -1,   -1,    1],
       [  -1,    1,    1],
       [   1,    1, -106],
       [   1, -106,  -44]], dtype=int8)
>>> consecutive=strided[:-N+1].sum(axis=-1)
array([ 1,  3,  1,  1, -1, -1, -1,  1])
>>> np.nonzero(np.abs(consecutive)==N)
(array([1]),)

回答2:

import numpy as np
x = np.random.rand(100)
f = np.sign(x - x.mean())
c = np.cumsum(f)
d = c[9:] - c[:-9]
print np.max(d), np.min(d)

if np.max(d) == 9 or np.min(d) == -9 then there are nine (or more) points in a row are on the same side of the mean.

Or you can use following code to calculate the length of every row:

np.diff(np.where(np.diff(np.r_[-2,f,-2]))[0])

回答3:

Given data and minimal length, you could check, whether the array

np.diff(np.cumsum(np.sign(data - np.mean(data))), length)

contains zero.

回答4:

another possibility: use correlate or convolve

>>> a = np.random.randn(50)
>>> b = (a - a.mean()) > 0
>>> b.astype(int)
array([0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 1, 1, 1])

>>> c = np.correlate(b, np.ones(3), mode='valid')
>>> c
array([ 2.,  2.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,  2.,  3.,  2.,  2.,
        1.,  1.,  0.,  0.,  1.,  2.,  3.,  3.,  3.,  3.,  3.,  2.,  2.,
        2.,  2.,  2.,  1.,  1.,  1.,  1.,  2.,  1.,  2.,  2.,  2.,  1.,
        0.,  0.,  1.,  2.,  2.,  2.,  2.,  3.,  3.])

>>> c.max() == 3
True
>>> c.min() == 0
True

It will be slower than HYRY cumsum version.

aside: there is a runstest in statsmodels for testing similar runs

来源：https://stackoverflow.com/questions/12370349/reasoning-about-consecutive-data-points-without-using-iteration

标签

python

numpy

pandas

spc