Why is numpy.any so slow over large arrays?

后端 未结 1 876
天涯浪人
天涯浪人 2020-12-25 10:57

I\'m looking for the most efficient way to determine whether a large array contains at least one nonzero value. At first glance np.any seems like the obvious to

相关标签:
1条回答
  • 2020-12-25 11:49

    As has been guessed in the comments, I can confirm that the processing of the array is being done in chunks. First, I will show you where things are in the code and then I will show you how you can change the chunk size and the effect that doing so has on your benchmark.

    Where to find the reduction processing in the Numpy source files

    np.all(x) is the same as x.all(). all() really calls np.core.umath.logical_and.reduce(x).

    If you want to dig into the numpy source, I will try to guide you through finding that a buffer/chunk size is used. The folder with all of the code we will be looking at is numpy/core/src/umath/.

    PyUFunc_Reduce() in ufunc_object.c is the C function that handles the reduce. In PyUFunc_Reduce(), the chunk, or buffer, size is found by looking up the value for reduce in some global dictionary via the PyUFunc_GetPyValues() function (ufunc_object.c). On my machine and compiling from the development branch, the chunk size is 8192. PyUFunc_ReduceWrapper() in reduction.c is called to set-up the iterator (with a stride equal to the chunk size) and it calls the passed in loop function which is reduce_loop() in ufunc_object.c.

    reduce_loop() basically just uses the iterator and calls another innerloop() function for each chunk. The innerloop function is found in loops.c.src. For a boolean array and our case of all/logical_and, the appropriate innerloop function is BOOL_logical_and. You can find the right function by searching for BOOLEAN LOOPS and then it is the second function below that (it is hard to find due to the template-like programming used here). There you will find that short circuiting is in fact being done for each chunk.

    How to change the buffer size used in ufunctions (and thus in any/all)

    You can get the chunk/buffer size with np.getbuffersize(). For me, that returns 8192 without manually setting it which matches what I found by printing out the buffer size in the code. You can use np.setbuffersize() to change the chunk size.

    Results using a bigger buffer size

    I changed your benchmark code to the following:

    import timeit
    import numpy as np
    print 'Numpy v%s' %np.version.full_version
    stmt = "np.all(x)"
    for ii in xrange(9):
        setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool); np.setbufsize(%d)" %(10**ii, max(8192, min(10**ii, 10**7)))
        timer = timeit.Timer(stmt,setup)
        n,r = 1,3
        t = np.min(timer.repeat(r,n))
        while t < 0.2:
            n *= 10
            t = np.min(timer.repeat(r,n))
        t /= n
        if t < 1E-3:
            timestr = "%1.3f us" %(t*1E6)
        elif t < 1:
            timestr = "%1.3f ms" %(t*1E3)
        else:
            timestr = "%1.3f s" %t
        print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr)
    

    Numpy doesn't like the buffer size being too small or too big so I made sure that it didn't get smaller than 8192 or larger than 1E7 because Numpy didn't like a buffer size of 1E8. Otherwise, I was setting the buffer size to the size of the array being processed. I only went up to 1E8 because my machine only has 4GB of memory at the moment. Here are the results:

    Numpy v1.8.0.dev-2a5c2c8
    Array size: 1E0, 100000 loops, best of 3: 5.351 us/loop
    Array size: 1E1, 100000 loops, best of 3: 5.390 us/loop
    Array size: 1E2, 100000 loops, best of 3: 5.366 us/loop
    Array size: 1E3, 100000 loops, best of 3: 5.360 us/loop
    Array size: 1E4, 100000 loops, best of 3: 5.433 us/loop
    Array size: 1E5, 100000 loops, best of 3: 5.400 us/loop
    Array size: 1E6, 100000 loops, best of 3: 5.397 us/loop
    Array size: 1E7, 100000 loops, best of 3: 5.381 us/loop
    Array size: 1E8, 100000 loops, best of 3: 6.126 us/loop
    

    There is a small uptick in the last timing because there are multiple chunks being processed due to the limitations on how big the buffer size can be.

    0 讨论(0)
提交回复
热议问题