I have a NumPy array [1,2,3,4,5,6,7,8,9,10,11,12,13,14]
and want to have an array structured like [[1,2,3,4], [2,3,4,5], [3,4,5,6], ..., [11,12,13,14]]
The fastest way seems to be to preallocate the array, given as option 7 right at the bottom of this answer.
>>> import numpy as np
>>> A=np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14])
>>> A
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
>>> np.array(zip(A,A[1:],A[2:],A[3:]))
array([[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
[ 3, 4, 5, 6],
[ 4, 5, 6, 7],
[ 5, 6, 7, 8],
[ 6, 7, 8, 9],
[ 7, 8, 9, 10],
[ 8, 9, 10, 11],
[ 9, 10, 11, 12],
[10, 11, 12, 13],
[11, 12, 13, 14]])
>>>
You can easily adapt this to do it for variable chunk size.
>>> n=5
>>> np.array(zip(*(A[i:] for i in range(n))))
array([[ 1, 2, 3, 4, 5],
[ 2, 3, 4, 5, 6],
[ 3, 4, 5, 6, 7],
[ 4, 5, 6, 7, 8],
[ 5, 6, 7, 8, 9],
[ 6, 7, 8, 9, 10],
[ 7, 8, 9, 10, 11],
[ 8, 9, 10, 11, 12],
[ 9, 10, 11, 12, 13],
[10, 11, 12, 13, 14]])
You may wish to compare performance between this and using itertools.islice
.
>>> from itertools import islice
>>> n=4
>>> np.array(zip(*[islice(A,i,None) for i in range(n)]))
array([[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
[ 3, 4, 5, 6],
[ 4, 5, 6, 7],
[ 5, 6, 7, 8],
[ 6, 7, 8, 9],
[ 7, 8, 9, 10],
[ 8, 9, 10, 11],
[ 9, 10, 11, 12],
[10, 11, 12, 13],
[11, 12, 13, 14]])
1. timeit np.array(zip(A,A[1:],A[2:],A[3:]))
10000 loops, best of 3: 92.9 us per loop
2. timeit np.array(zip(*(A[i:] for i in range(4))))
10000 loops, best of 3: 101 us per loop
3. timeit np.array(zip(*[islice(A,i,None) for i in range(4)]))
10000 loops, best of 3: 101 us per loop
4. timeit numpy.array([ A[i:i+4] for i in range(len(A)-3) ])
10000 loops, best of 3: 37.8 us per loop
5. timeit numpy.array(list(chunks(A, 4)))
10000 loops, best of 3: 43.2 us per loop
6. timeit numpy.array(byN(A, 4))
10000 loops, best of 3: 100 us per loop
# Does preallocation of the array help? (11 is from len(A)+1-4)
7. timeit B=np.zeros(shape=(11, 4),dtype=np.int32)
100000 loops, best of 3: 2.19 us per loop
timeit for i in range(4):B[:,i]=A[i:11+i]
10000 loops, best of 3: 20.9 us per loop
total 23.1us per loop
As len(A) increases (20000) 4 and 5 converge to be equivalent speed (44 ms). 1,2,3 and 6 all remain about 3 times slower (135 ms). 7 is much faster (1.36 ms).
Quick&dirty solution:
>>> a = numpy.arange(1,15)
>>> numpy.array([ a[i:i+4] for i in range(len(a)-3) ])
array([[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
[ 3, 4, 5, 6],
[ 4, 5, 6, 7],
[ 5, 6, 7, 8],
[ 6, 7, 8, 9],
[ 7, 8, 9, 10],
[ 8, 9, 10, 11],
[ 9, 10, 11, 12],
[10, 11, 12, 13],
[11, 12, 13, 14]])
Using itertools, and assuming Python 2.6:
import itertools
def byN(iterable, N):
itrs = itertools.tee(iter(iterable), N)
for n in range(N):
for i in range(n):
next(itrs[n], None)
return zip(*itrs)
aby4 = numpy.array(byN(thearray, 4))
I know of no Python stdlib function that quite does that. It's easy enough to do. Here is a generator that basically does it:
def chunks(sequence, length):
for index in xrange(0, len(sequence) - length + 1):
yield sequence[index:index + length]
You could use it like so
>>> import numpy
>>> a = numpy.arange(1, 15)
>>> a
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
>>> numpy.array(list(chunks(a, 4)))
array([[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
[ 3, 4, 5, 6],
[ 4, 5, 6, 7],
[ 5, 6, 7, 8],
[ 6, 7, 8, 9],
[ 7, 8, 9, 10],
[ 8, 9, 10, 11],
[ 9, 10, 11, 12],
[10, 11, 12, 13],
[11, 12, 13, 14]])
The only weird thing about this code is that I called list
on the result of chunks(a, 4)
. This is since numpy.array
does not accept an arbitrary iterable, such as the generator chunks
returns. If you just want to iterate over these chunks, you don't need to bother. If you really need to put the result into an array you can do it this way or some more efficient ways.
The efficient NumPy way to do this is given here, which is somewhat too long to reproduce here. It boils down to using some stride tricks, and is much faster than itertools for largish window sizes. For example, using a method essentially the same as that of of Alex Martelli's:
In [16]: def windowed(sequence, length):
seqs = tee(sequence, length)
[ seq.next() for i, seq in enumerate(seqs) for j in xrange(i) ]
return zip(*seqs)
We get:
In [19]: data = numpy.random.randint(0, 2, 1000000)
In [20]: %timeit windowed(data, 2)
100000 loops, best of 3: 6.62 us per loop
In [21]: %timeit windowed(data, 10)
10000 loops, best of 3: 29.3 us per loop
In [22]: %timeit windowed(data, 100)
1000 loops, best of 3: 1.41 ms per loop
In [23]: %timeit segment_axis(data, 2, 1)
10000 loops, best of 3: 30.1 us per loop
In [24]: %timeit segment_axis(data, 10, 9)
10000 loops, best of 3: 30.2 us per loop
In [25]: %timeit segment_axis(data, 100, 99)
10000 loops, best of 3: 30.5 us per loop
You should use stride_tricks
. When I first saw this, the word 'magic' did spring to mind. It's simple and is by far the fastest method.
>>> as_strided = numpy.lib.stride_tricks.as_strided
>>> a = numpy.arange(1,15)
>>> a
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
>>> b = as_strided(a, (11,4), a.strides*2)
>>> b
array([[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
[ 3, 4, 5, 6],
[ 4, 5, 6, 7],
[ 5, 6, 7, 8],
[ 6, 7, 8, 9],
[ 7, 8, 9, 10],
[ 8, 9, 10, 11],
[ 9, 10, 11, 12],
[10, 11, 12, 13],
[11, 12, 13, 14]])
Be aware that the values in array b
are those in a
, just viewed differently. Do a .copy()
on b
if you plan to modify it.
I saw this at a SciPy conference. Here are the slides for more explanation.