Interpretation vs dynamic dispatch penalty in Python

前端 未结 1 478
無奈伤痛
無奈伤痛 2020-12-01 17:25

I watched Brandon Rhodes\' talk about Cython - \"The Day of the EXE Is Upon Us\".

Brandon mentions at 09:30 that for a specific short piece of code, skipping interp

相关标签:
1条回答
  • 2020-12-01 18:00

    Let's take a look at this python function:

    def py_fun(i,N,step):
         res=0.0
         while i<N:
            res+=i
            i+=step
         return res
    

    and use ipython-magic to time it:

    In [11]: %timeit py_fun(0.0,1.0e5,1.0)
    10 loops, best of 3: 25.4 ms per loop
    

    The interpreter will be running through the resulting bytecode and interpret it. However, we could cut out the interpreter by using cython for/cythonizing the very same code:

    %load_ext Cython
    %%cython
    def cy_fun(i,N,step):
         res=0.0
         while i<N:
            res+=i
            i+=step
         return res
    

    We get a speed up of 50% for it:

    In [13]: %timeit cy_fun(0.0,1.0e5,1.0)
    100 loops, best of 3: 10.9 ms per loop
    

    When we look into the produced c-code, we see that the right functions are called directly without the need of being interpreted/calling ceval, here after stripping down the boilerplate code:

    static PyObject *__pyx_pf_4test_cy_fun(CYTHON_UNUSED PyObject *__pyx_self, PyObject *__pyx_v_i, PyObject *__pyx_v_N, PyObject *__pyx_v_step) {
      ...
      while (1) {
        __pyx_t_1 = PyObject_RichCompare(__pyx_v_i, __pyx_v_N, Py_LT); 
        ...
        __pyx_t_2 = __Pyx_PyObject_IsTrue(__pyx_t_1);
        ...
        if (!__pyx_t_2) break;
        ...
        __pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_res, __pyx_v_i);
        ...
        __pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_i, __pyx_v_step); 
      }
      ...
      return __pyx_r;
    }
    

    However, this cython function handles python-objects and not c-style floats, so in the function PyNumber_InPlaceAdd it is necessary to figure out, what these objects (integer, float, something else?) really are and to dispatch this call to right functions which would do the job.

    With help of cython we could also eliminate the need for this dispatch and to call directly the multiplication for floats:

     %%cython
     def c_fun(double i,double N, double step):
          cdef double res=0.0
          while i<N:
             res+=i
             i+=step
          return res
    

    In this version, i, N, step and res are c-style doubles and no longer python objects. So there is no longer need to call dispatch-functions like PyNumber_InPlaceAdd but we can directly call +-operator for double:

    static PyObject *__pyx_pf_4test_c_fun(CYTHON_UNUSED PyObject *__pyx_self, double __pyx_v_i, double __pyx_v_N, double __pyx_v_step) {
      ...
      __pyx_v_res = 0.0;  
      ... 
      while (1) {
        __pyx_t_1 = ((__pyx_v_i < __pyx_v_N) != 0);
        if (!__pyx_t_1) break;
        __pyx_v_res = (__pyx_v_res + __pyx_v_i);
        __pyx_v_i = (__pyx_v_i + __pyx_v_step);
      }
      ...
      return __pyx_r;
    }
    

    And the result is:

    In [15]: %timeit c_fun(0.0,1.0e5,1.0)
    10000 loops, best of 3: 148 µs per loop
    

    Now, this is a speed-up of almost 100 compared to the version without interpreter but with dispatch.

    Actually, to say, that dispatch+allocation is the bottle neck here (because eliminating it caused a speed-up of almost factor 100) is a fallacy: the interpreter is responsible for more than 50% of the running time (15 ms) and dispatch and allocation "only" for 10ms.


    However, there are more problems than "interpreter" and dynamic dispatch for the performance: Float is immutable, so every time it changes a new object must be created and registered/unregistered in garbage collector.

    We can introduce mutable floats, which are changed in place and don't need registering/unregistering:

    %%cython
    cdef class MutableFloat: 
     cdef double x      
     def __cinit__(self, x):
        self.x=x         
     def __iadd__(self, MutableFloat other):
        self.x=self.x+other.x
        return self
     def __lt__(MutableFloat self,  MutableFloat other):
        return self.x<other.x
     def __gt__(MutableFloat self, MutableFloat other):
        return self.x>other.x
     def __repr__(self):
        return str(self.x)
    

    The timings (now I use a different machine, so the timings a little bit different):

    def py_fun(i,N,step,acc):
            while i<N:
                 acc+=i
                 i+=step
            return acc
    
    %timeit py_fun(1.0, 5e5,1.0,0.0)
    30.2 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each 
    %timeit cy_fun(1.0, 5e5,1.0,0.0)
    16.9 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    %timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
        ...: .0),MutableFloat(0.0); py_fun(i,N,step,acc)
    23 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    %timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
    ...: .0),MutableFloat(0.0); cy_fun(i,N,step,acc)
    11 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Don't forget to reinitialize i because it is mutable! The results

                immutable       mutable
     py_fun       30ms           23ms
     cy_fun       17ms           11ms
    

    So up to 7ms (about 20%) are needed for registering/unregistering floats (I'm not sure there is not something else playing a role) in the version with the interpreter and more then 33% in the version without the interpreter.

    As it looks now:

    • 40% (13/30) of the time is used by interpreter
    • up to 33% of the time is used for the dynamic dispatch
    • up to 20% of the time is used for creating/deleting temporary objects
    • about 1% for the arithmetical operations

    Another problem is the locality of the data, which becomes obvious for memory band-width bound problems: The modern caches work well for if data processed linearly one consecutive memory address after another. This is true for looping over std::vector<> (or array.array), but not for looping over python lists, because this list consists of pointers which can point to any place in the memory.

    Consider the following python scripts:

    #list.py
    N=int(1e7)
    lst=[0]*int(N)
    for i in range(N):
      lst[i]=i
    print(sum(lst)) 
    

    and

    #byte
    N=int(1e7)
    b=bytearray(8*N)
    m=memoryview(b).cast('L') #reinterpret as an array of unsigned longs
    for i in range(N):
      m[i]=i
    print(sum(m))
    

    they both create 1e7 integers, the first version Python-integers and the second the lowly c-ints which are placed continuously in the memory.

    The interesting part is, how many cache misses (D) these scripts produce:

    valgrind --tool=cachegrind python list.py 
    ...
    D1  misses:        33,964,276  (   27,473,138 rd   +     6,491,138 wr)
    

    versus

    valgrind --tool=cachegrind python bytearray.py 
    ...
    D1  misses:         4,796,626  (    2,140,357 rd   +     2,656,269 wr)
    

    That means 8 time more cache misses for the python-integers. Some part of it is due to the fact, that python integers need more than 8 bytes (probably 32bytes, i.e. factor 4) memory and (maybe, not 100% sure, because neighboring integers are created after each other, so the chances are high, they are stored after each other somewhere in memory, further investigation needed) some due to the fact, that they aren't aligned in memory as it is the case for c-integers of bytearray.

    0 讨论(0)
提交回复
热议问题