I\'m looking for the most efficient way to determine whether a large array
contains at least one nonzero value. At first glance np.any
seems like the
obvious to
As has been guessed in the comments, I can confirm that the processing of the array is being done in chunks. First, I will show you where things are in the code and then I will show you how you can change the chunk size and the effect that doing so has on your benchmark.
np.all(x) is the same as x.all(). all() really calls np.core.umath.logical_and.reduce(x).
If you want to dig into the numpy source, I will try to guide you through finding that a buffer/chunk size is used. The folder with all of the code we will be looking at is numpy/core/src/umath/.
PyUFunc_Reduce() in ufunc_object.c is the C function that handles the reduce. In PyUFunc_Reduce(), the chunk, or buffer, size is found by looking up the value for reduce in some global dictionary via the PyUFunc_GetPyValues() function (ufunc_object.c). On my machine and compiling from the development branch, the chunk size is 8192. PyUFunc_ReduceWrapper() in reduction.c is called to set-up the iterator (with a stride equal to the chunk size) and it calls the passed in loop function which is reduce_loop() in ufunc_object.c.
reduce_loop() basically just uses the iterator and calls another innerloop() function for each chunk. The innerloop function is found in loops.c.src. For a boolean array and our case of all/logical_and, the appropriate innerloop function is BOOL_logical_and. You can find the right function by searching for BOOLEAN LOOPS and then it is the second function below that (it is hard to find due to the template-like programming used here). There you will find that short circuiting is in fact being done for each chunk.
You can get the chunk/buffer size with np.getbuffersize(). For me, that returns 8192 without manually setting it which matches what I found by printing out the buffer size in the code. You can use np.setbuffersize() to change the chunk size.
I changed your benchmark code to the following:
import timeit
import numpy as np
print 'Numpy v%s' %np.version.full_version
stmt = "np.all(x)"
for ii in xrange(9):
setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool); np.setbufsize(%d)" %(10**ii, max(8192, min(10**ii, 10**7)))
timer = timeit.Timer(stmt,setup)
n,r = 1,3
t = np.min(timer.repeat(r,n))
while t < 0.2:
n *= 10
t = np.min(timer.repeat(r,n))
t /= n
if t < 1E-3:
timestr = "%1.3f us" %(t*1E6)
elif t < 1:
timestr = "%1.3f ms" %(t*1E3)
else:
timestr = "%1.3f s" %t
print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr)
Numpy doesn't like the buffer size being too small or too big so I made sure that it didn't get smaller than 8192 or larger than 1E7 because Numpy didn't like a buffer size of 1E8. Otherwise, I was setting the buffer size to the size of the array being processed. I only went up to 1E8 because my machine only has 4GB of memory at the moment. Here are the results:
Numpy v1.8.0.dev-2a5c2c8
Array size: 1E0, 100000 loops, best of 3: 5.351 us/loop
Array size: 1E1, 100000 loops, best of 3: 5.390 us/loop
Array size: 1E2, 100000 loops, best of 3: 5.366 us/loop
Array size: 1E3, 100000 loops, best of 3: 5.360 us/loop
Array size: 1E4, 100000 loops, best of 3: 5.433 us/loop
Array size: 1E5, 100000 loops, best of 3: 5.400 us/loop
Array size: 1E6, 100000 loops, best of 3: 5.397 us/loop
Array size: 1E7, 100000 loops, best of 3: 5.381 us/loop
Array size: 1E8, 100000 loops, best of 3: 6.126 us/loop
There is a small uptick in the last timing because there are multiple chunks being processed due to the limitations on how big the buffer size can be.