Demystifying sharedctypes performance

后端 未结 3 2106
自闭症患者
自闭症患者 2021-02-13 05:36

In python it is possible to share ctypes objects between multiple processes. However I notice that allocating these objects seems to be extremely expensive.

Consider fol

3条回答
  •  梦如初夏
    2021-02-13 05:52

    Sample Code

    I rewrote your sample code a little bit to look into this issue. Here's where I landed, I'll use it in my answer below:

    so.py:

    from multiprocessing import sharedctypes as sct
    import ctypes as ct
    import numpy as np
    
    n = 100000
    l = np.random.randint(0, 10, size=n)
    
    
    def sct_init():
        sh = sct.RawArray(ct.c_int, l)
        return sh
    
    def sct_subscript():
        sh = sct.RawArray(ct.c_int, n)
        sh[:] = l
        return sh
    
    def ct_init():
        sh = (ct.c_int * n)(*l)
        return sh
    
    def ct_subscript():
        sh = (ct.c_int * n)(n)
        sh[:] = l
        return sh
    

    Note that I added two test cases that do not use shared memory (and use regular a ctypes array instead).

    timer.py:

    import traceback
    from timeit import timeit
    
    for t in ["sct_init", "sct_subscript", "ct_init", "ct_subscript"]:
        print(t)
        try:
            print(timeit("{0}()".format(t), setup="from so import {0}".format(t), number=100))
        except Exception as e:
            print("Failed:", e)
            traceback.print_exc()
        print
    
    print()
    
    print ("Test",)
    from so import *
    sh1 = sct_init()
    sh2 = sct_subscript()
    
    for i in range(n):
        assert sh1[i] == sh2[i]
    print("OK")
    

    Test results

    The results from running the above code using Python 3.6a0 (specifically 3c2fbdb) are:

    sct_init
    2.844902500975877
    sct_subscript
    0.9383537038229406
    ct_init
    2.7903486443683505
    ct_subscript
    0.978101353161037
    
    Test
    OK
    

    What's interesting is that if you change n, the results scale linearly. For example, using n = 100000 (10 times bigger), you get something that's pretty much 10 times slower:

    sct_init
    30.57974253082648
    sct_subscript
    9.48625904135406
    ct_init
    30.509132395964116
    ct_subscript
    9.465419146697968
    
    Test
    OK
    

    Speed difference

    In the end, the speed difference lies in the hot loop that is called to initialize the array by copying every single value over from the Numpy array (l) to the new array (sh). This makes sense, because as we noted speed scales linearly with array size.

    When you pass the Numpy array as a constructor argument, the function that does this is Array_init. However, if you assign using sh[:] = l, then it's Array_ass_subscript that does the job.

    Again, what matters here are the hot loops. Let's look at them.

    Array_init hot loop (slower):

    for (i = 0; i < n; ++i) {
        PyObject *v;
        v = PyTuple_GET_ITEM(args, i);
        if (-1 == PySequence_SetItem((PyObject *)self, i, v))
            return -1;
    }
    

    Array_ass_subscript hot loop (faster):

    for (cur = start, i = 0; i < otherlen; cur += step, i++) {
        PyObject *item = PySequence_GetItem(value, i);
        int result;
        if (item == NULL)
            return -1;
        result = Array_ass_item(myself, cur, item);
        Py_DECREF(item);
        if (result == -1)
            return -1;
    }
    

    As it turns out, the majority of the speed difference lies in using PySequence_SetItem vs. Array_ass_item.

    Indeed, if you change the code for Array_init to use Array_ass_item instead of PySequence_SetItem (if (-1 == Array_ass_item((PyObject *)self, i, v))), and recompile Python, the new results become:

    sct_init
    11.504781467840075
    sct_subscript
    9.381130554247648
    ct_init
    11.625461496878415
    ct_subscript
    9.265848568174988
    
    Test
    OK
    

    Still a bit slower, but not by much.

    In other words, most of the overhead is caused by a slower hot loop, and mostly caused by the code that PySequence_SetItem wraps around Array_ass_item.

    This code might appear like little overhead at first read, but it really isn't.

    PySequence_SetItem actually calls into the entire Python machinery to resolve the __setitem__ method and call it.

    This eventually resolves in a call to Array_ass_item, but only after a large number of levels of indirection (which a direct call to Array_ass_item would bypass entirely!)

    Going through the rabbit hole, the call sequence looks a bit like this:

    • s->ob_type->tp_as_sequence->sq_ass_item points to slot_sq_ass_item.
    • slot_sq_ass_item calls into call_method.
    • call_method calls into PyObject_Call
    • And on and on until we eventually get to Array_ass_item..!

    In other words, we have C code in Array_init that's calling Python code (__setitem__) in a hot loop. That's slow.

    Why ?

    Now, why does Python use PySequence_SetItem in Array_init and not Array_ass_item in Array_init?

    That's because if it did, it would be bypassing the hooks that are exposed to the developer in Python-land.

    Indeed, you can intercept calls to sh[:] = ... by subclassing the array and overriding __setitem__ (__setslice__ in Python 2). It will be called once, with a slice argument for the index.

    Likewise, defining your own __setitem__ also overrides the logic in the constructor. It will be called N times, with an integer argument for the index.

    This means that if Array_init directly called into Array_ass_item, then you would lose something: __setitem__ would no longer be called in the constructor, and you wouldn't be able to override the behavior anymore.

    Now can we try to retain the faster speed all the while still exposing the same Python hooks?

    Well, perhaps, using this code in Array_init instead of the existing hot loop:

     return PySequence_SetSlice((PyObject*)self, 0, PyTuple_GET_SIZE(args), args);
    

    Using this will call into __setitem__ once with a slice argument (on Python 2, it would call into __setslice__). We still go through the Python hooks, but we only do it once instead of N times.

    Using this code, the performance becomes:

    sct_init
    12.24651838419959
    sct_subscript
    10.984305887017399
    ct_init
    12.138383641839027
    ct_subscript
    11.79078131634742
    
    Test
    OK
    

    Other overhead

    I think the rest of the overhead may be due to the tuple instantiation that takes place when calling __init__ on the array object (note the *, and the fact that Array_init expects a tuple for args) — this presumably scales with n as well.

    Indeed, if you replace sh[:] = l with sh[:] = tuple(l) in the test case, then the performance results become almost identical. With n = 100000:

    sct_init
    11.538272527977824
    sct_subscript
    10.985187001060694
    ct_init
    11.485244687646627
    ct_subscript
    10.843198659364134
    
    Test
    OK
    

    There's probably still something smaller going on, but ultimately we're comparing two substantially different hot loops. There's simply little reason to expect them to have identical performance.

    I think it might be interesting to try calling Array_ass_subscript from Array_init for the hot loop and see the results, though!

    Baseline speed

    Now, to your second question, regarding allocating shared memory.

    Note that there isn't really a cost to allocating shared memory. As noted in the results above, there isn't a substantial difference between using shared memory or not.

    Looking at the Numpy code (np.arange is implemented here), we can finally understand why it's so much faster than sct.RawArray: np.arange doesn't appear to make calls to Python "user-land" (i.e. no call to PySequence_GetItem or PySequence_SetItem).

    That doesn't necessarily explain all the difference, but you'd probably want to start investigating there.

提交回复
热议问题