NumPy performance: uint8 vs. float and multiplication vs. division?

后端 未结 4 1173
暖寄归人
暖寄归人 2021-02-02 14:50

I have just noticed that the execution time of a script of mine nearly halves by only changing a multiplication to a division.

To investigate this, I have written a smal

4条回答
  •  梦谈多话
    2021-02-02 15:22

    The problem is your assumption, that you measure the time needed for division or multiplication, which is not true. You are measuring the overhead needed for a division or multiplication.

    One has really to look at the exact code to explain every effect, which can vary from version to version. This answer can only give an idea, what one has to consider.

    The problem is that a simple int is not simple at all in python: it is a real object which must be registered in the garbage collector, it grows in size with its value - for all that you have to pay: for example for a 8bit integer 24 bytes memory are needed! similar goes for python-floats.

    On the other hand, a numpy array consists of simple c-style integers/floats without overhead, you save a lot of memory, but pay for it during the access to an element of numpy-array. a[i] means: a python-integer must be constructed, registered in the garbage collector and only than it can be used - there is a lot of overhead.

    Consider this code:

    li1=[x%256 for x in xrange(10**4)]
    arr1=np.array(li1, np.uint8)
    
    def arrmult(a):    
        for i in xrange(len(a)):
            a[i]*=5;
    

    arrmult(li1) is 25 faster than arrmult(arr1) because integers in the list are already python-ints and don't have to be created! The lion's share of the calculation time is needed for creation of the objects - everything else can be almost neglected.


    Let's take a look at your code, first the multiplication:

    def arrmult2(a):
        ...
        b[i, j] = (b[i, j] + 5) * 0.5
    

    In the case of the uint8 the following must happen (I neglect +5 for simplicity):

    1. a python-int must be created
    2. it must be casted to a float (python-float creation), in order to be able to do float multiplication
    3. and casted back to a python-int or/and uint8

    For float32, there is less work to do (multiplication does not cost much): 1. a python-float created 2. casted back float32.

    So the float-version should be faster and it is.


    Now let's take a look at the division:

    def arrdiv2(a):
        ...
        b[i, j] = (b[i, j] + 5)  / 2 
    

    The pitfall here: All operations are integer-operations. So compared to multiplication there is no need to cast to python-float, thus we have less overhead as in the case of multiplication. Division is "faster" for unint8 than multiplication in your case.

    However, division and multiplication are equally fast/slow for float32, because almost nothing has changed in this case - we still need to create a python-float.


    Now the vectorized versions: they work with c-style "raw" float32s/uint8s without conversion (and its cost!) to the corresponding python-objects under the hood. To get meaningful results you should increase the number of iteration (right now the running time is too small to say something with certainty).

    1. division and multiplication for float32 could have the same running time, because I would expect numpy to replace the division by 2 through multiplication by 0.5 (but to be sure one has to look into the code).

    2. multiplication for uint8 should be slower, because every uint8-integer must be casted to a float prior to multiplication with 0.5 and than casted back to uint8 afterwards.

    3. for the uint8 case, the numpy cannot replace the division by 2 through multiplication with 0.5 because it is an integer division. Integer division is slower than float-multiplication for a lot of architectures - this is the slowest vectorized operation.


    PS: I would not dwell too much about costs multiplication vs. division - there are too many other things that can have a bigger hit on the performance. For example creating unnecessary temporary objects or if the numpy-array is large and does not fit into the cache, than the memory access will be the bottle-neck - you will see no difference between multiplication and division at all.

提交回复
热议问题