NumPy performance: uint8 vs. float and multiplication vs. division?

后端 未结 4 1166
暖寄归人
暖寄归人 2021-02-02 14:50

I have just noticed that the execution time of a script of mine nearly halves by only changing a multiplication to a division.

To investigate this, I have written a smal

相关标签:
4条回答
  • 2021-02-02 15:13

    It's the very first operation that will typically take longer before "warming up" (e.g. memory allocated, caching).

    See the same effect using the reverse order of dividing and multiplying:

    >>> print_time("arrdiv", timeit.timeit("arrdiv(arr2)", "from __main__ import arrdiv, arr2", number=timeit_iterations))
    >>> print_time("arrmult", timeit.timeit("arrmult(arr2)", "from __main__ import arrmult, arr2", number=timeit_iterations))
    
    arrdiv:  3.2630s
    arrmult:  2.5873s
    
    0 讨论(0)
  • 2021-02-02 15:14

    This answer only looks at vectorised operations, as the reason for the other operations being slow has been answered by ead.

    A lot of "optimisations" are based on old hardware. The assumptions that meant that optimisations held true on older hardware do not old true on newer hardware.

    Pipelines and division

    Division is slow. Division operations consist of several units that each have to perform one calculation one after another. This is what makes division slow.

    However, in a floating-point processing unit (FPU) [common on most modern CPUs] there are dedicated units arranged in a "pipeline" for the division instruction. Once a unit is done, that unit isn't needed for the rest of the operation. If you have several division operations you can get these units with nothing to do started on the next division operation. So though each operation is slow, the FPU can actually achieve a high throughput of division operations. Pipeline-ing isn't the same as vectorisation, but the results are mostly the same -- higher throughput when you have lots of the same operations to do.

    Think of pipeline-ing like traffic. Compare three lanes of traffic moving at 30 mph versus one lane of traffic moving at 90 mph. The slower traffic is definitely slower individually, but the three-lane-road still has the same throughput.

    0 讨论(0)
  • 2021-02-02 15:22

    The problem is your assumption, that you measure the time needed for division or multiplication, which is not true. You are measuring the overhead needed for a division or multiplication.

    One has really to look at the exact code to explain every effect, which can vary from version to version. This answer can only give an idea, what one has to consider.

    The problem is that a simple int is not simple at all in python: it is a real object which must be registered in the garbage collector, it grows in size with its value - for all that you have to pay: for example for a 8bit integer 24 bytes memory are needed! similar goes for python-floats.

    On the other hand, a numpy array consists of simple c-style integers/floats without overhead, you save a lot of memory, but pay for it during the access to an element of numpy-array. a[i] means: a python-integer must be constructed, registered in the garbage collector and only than it can be used - there is a lot of overhead.

    Consider this code:

    li1=[x%256 for x in xrange(10**4)]
    arr1=np.array(li1, np.uint8)
    
    def arrmult(a):    
        for i in xrange(len(a)):
            a[i]*=5;
    

    arrmult(li1) is 25 faster than arrmult(arr1) because integers in the list are already python-ints and don't have to be created! The lion's share of the calculation time is needed for creation of the objects - everything else can be almost neglected.


    Let's take a look at your code, first the multiplication:

    def arrmult2(a):
        ...
        b[i, j] = (b[i, j] + 5) * 0.5
    

    In the case of the uint8 the following must happen (I neglect +5 for simplicity):

    1. a python-int must be created
    2. it must be casted to a float (python-float creation), in order to be able to do float multiplication
    3. and casted back to a python-int or/and uint8

    For float32, there is less work to do (multiplication does not cost much): 1. a python-float created 2. casted back float32.

    So the float-version should be faster and it is.


    Now let's take a look at the division:

    def arrdiv2(a):
        ...
        b[i, j] = (b[i, j] + 5)  / 2 
    

    The pitfall here: All operations are integer-operations. So compared to multiplication there is no need to cast to python-float, thus we have less overhead as in the case of multiplication. Division is "faster" for unint8 than multiplication in your case.

    However, division and multiplication are equally fast/slow for float32, because almost nothing has changed in this case - we still need to create a python-float.


    Now the vectorized versions: they work with c-style "raw" float32s/uint8s without conversion (and its cost!) to the corresponding python-objects under the hood. To get meaningful results you should increase the number of iteration (right now the running time is too small to say something with certainty).

    1. division and multiplication for float32 could have the same running time, because I would expect numpy to replace the division by 2 through multiplication by 0.5 (but to be sure one has to look into the code).

    2. multiplication for uint8 should be slower, because every uint8-integer must be casted to a float prior to multiplication with 0.5 and than casted back to uint8 afterwards.

    3. for the uint8 case, the numpy cannot replace the division by 2 through multiplication with 0.5 because it is an integer division. Integer division is slower than float-multiplication for a lot of architectures - this is the slowest vectorized operation.


    PS: I would not dwell too much about costs multiplication vs. division - there are too many other things that can have a bigger hit on the performance. For example creating unnecessary temporary objects or if the numpy-array is large and does not fit into the cache, than the memory access will be the bottle-neck - you will see no difference between multiplication and division at all.

    0 讨论(0)
  • 2021-02-02 15:22

    It's because you multiply an int by a float and store the result as an int. Try your arr_mult and arr_div tests with different integer or float values for the multiplication / division. Especially, compare multiplying by '2' and multiplying by '2.'

    0 讨论(0)
提交回复
热议问题