Let\'s say I want to implement Numpy\'s
x[:] += 1
in Cython. I could write
@cython.boundscheck(False)
@cython.wraparoundcheck(F
http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html
Is a nice tutorial on using nditer
. It ends with a cython
version. nditer
is meant to be the all purpose array(s) iterator in numpy c
level code.
There are also good array examples on the Cython memoryview page:
http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html
http://docs.scipy.org/doc/numpy/reference/c-api.iterator.html
The data buffer of an ndarray
is a flat buffer. So regardless of the array's shape and strides, you can iterate over the whole buffer in a flat c
pointer fashion. But things like nditer
and memoryview
take care of the element size details. So in c
level coding it is actually easier to step through all the elements in a flat fashion than it is to step by rows - going by rows has to take into account the row stride.
This runs in Python, and I think it will translate nicely into cython (I don't have that setup on my machine at the moment):
import numpy as np
def add1(x):
it = np.nditer([x], op_flags=[['readwrite']])
for i in it:
i[...] += 1
return it.operands[0]
x = np.arange(10).reshape(2,5)
y = add1(x)
print(x)
print(y)
https://github.com/hpaulj/numpy-einsum/blob/master/sop.pyx is a sum-of-products script that I wrote a while back to simulate einsum
.
The core of its w = sop(x,y)
calculation is:
it = np.nditer(ops, flags, op_flags, op_axes=op_axes, order=order)
it.operands[nop][...] = 0
it.reset()
for xarr, yarr, warr in it:
x = xarr
y = yarr
w = warr
size = x.shape[0]
for i in range(size):
w[i] = w[i] + x[i] * y[i]
return it.operands[nop]
===================
copying ideas from the nditer.html
document, I got a version of add1
that is only half the speed of the native numpy
a+1
. The naive nditer
(above) isn't much faster in cython
than in python
. A lot of the speedup may be due to the external loop
.
@cython.boundscheck(False)
def add11(arg):
cdef np.ndarray[double] x
cdef int size
cdef double value
it = np.nditer([arg],
flags=['external_loop','buffered'],
op_flags=[['readwrite']])
for xarr in it:
x = xarr
size = x.shape[0]
for i in range(size):
x[i] = x[i]+1.0
return it.operands[0]
I also coded this nditer
in python with a print of size
, and found that it iterated on your b
with 78 blocks of size 8192 - that's a buffer size, not some characteristic of b
and its data layout.
In [15]: a = np.zeros((1000, 1000))
In [16]: b = a[100:-100, 100:-100]
In [17]: timeit add1.add11(b)
100 loops, best of 3: 4.48 ms per loop
In [18]: timeit b[:] += 1
100 loops, best of 3: 8.76 ms per loop
In [19]: timeit add1.add1(b) # for the unbuffered nditer
1 loop, best of 3: 3.1 s per loop
In [21]: timeit add1.add11(a)
100 loops, best of 3: 5.44 ms per loop