Demystifying sharedctypes performance

后端 未结 3 2142
南方客
南方客 2021-02-13 05:16

In python it is possible to share ctypes objects between multiple processes. However I notice that allocating these objects seems to be extremely expensive.

Consider fol

相关标签:
3条回答
  • 2021-02-13 05:41

    This should be a comment, but I do not have enough reputation :-(

    Starting with Python 3.5, shared arrays in Linux are created as temp files mapped to memory (see https://bugs.python.org/issue30919). I think this explains why creating a Numpy array, which is created in memory, is faster than creating and initializing a large shared array. To force Python to use shared memory, a workaround is to execute these two lines of code (ref. No space left while using Multiprocessing.Array in shared memory):

    from multiprocessing.process import current_process current_process()._config[‘tempdir’] = ‘/dev/shm’

    0 讨论(0)
  • 2021-02-13 05:45

    Sample Code

    I rewrote your sample code a little bit to look into this issue. Here's where I landed, I'll use it in my answer below:

    so.py:

    from multiprocessing import sharedctypes as sct
    import ctypes as ct
    import numpy as np
    
    n = 100000
    l = np.random.randint(0, 10, size=n)
    
    
    def sct_init():
        sh = sct.RawArray(ct.c_int, l)
        return sh
    
    def sct_subscript():
        sh = sct.RawArray(ct.c_int, n)
        sh[:] = l
        return sh
    
    def ct_init():
        sh = (ct.c_int * n)(*l)
        return sh
    
    def ct_subscript():
        sh = (ct.c_int * n)(n)
        sh[:] = l
        return sh
    

    Note that I added two test cases that do not use shared memory (and use regular a ctypes array instead).

    timer.py:

    import traceback
    from timeit import timeit
    
    for t in ["sct_init", "sct_subscript", "ct_init", "ct_subscript"]:
        print(t)
        try:
            print(timeit("{0}()".format(t), setup="from so import {0}".format(t), number=100))
        except Exception as e:
            print("Failed:", e)
            traceback.print_exc()
        print
    
    print()
    
    print ("Test",)
    from so import *
    sh1 = sct_init()
    sh2 = sct_subscript()
    
    for i in range(n):
        assert sh1[i] == sh2[i]
    print("OK")
    

    Test results

    The results from running the above code using Python 3.6a0 (specifically 3c2fbdb) are:

    sct_init
    2.844902500975877
    sct_subscript
    0.9383537038229406
    ct_init
    2.7903486443683505
    ct_subscript
    0.978101353161037
    
    Test
    OK
    

    What's interesting is that if you change n, the results scale linearly. For example, using n = 100000 (10 times bigger), you get something that's pretty much 10 times slower:

    sct_init
    30.57974253082648
    sct_subscript
    9.48625904135406
    ct_init
    30.509132395964116
    ct_subscript
    9.465419146697968
    
    Test
    OK
    

    Speed difference

    In the end, the speed difference lies in the hot loop that is called to initialize the array by copying every single value over from the Numpy array (l) to the new array (sh). This makes sense, because as we noted speed scales linearly with array size.

    When you pass the Numpy array as a constructor argument, the function that does this is Array_init. However, if you assign using sh[:] = l, then it's Array_ass_subscript that does the job.

    Again, what matters here are the hot loops. Let's look at them.

    Array_init hot loop (slower):

    for (i = 0; i < n; ++i) {
        PyObject *v;
        v = PyTuple_GET_ITEM(args, i);
        if (-1 == PySequence_SetItem((PyObject *)self, i, v))
            return -1;
    }
    

    Array_ass_subscript hot loop (faster):

    for (cur = start, i = 0; i < otherlen; cur += step, i++) {
        PyObject *item = PySequence_GetItem(value, i);
        int result;
        if (item == NULL)
            return -1;
        result = Array_ass_item(myself, cur, item);
        Py_DECREF(item);
        if (result == -1)
            return -1;
    }
    

    As it turns out, the majority of the speed difference lies in using PySequence_SetItem vs. Array_ass_item.

    Indeed, if you change the code for Array_init to use Array_ass_item instead of PySequence_SetItem (if (-1 == Array_ass_item((PyObject *)self, i, v))), and recompile Python, the new results become:

    sct_init
    11.504781467840075
    sct_subscript
    9.381130554247648
    ct_init
    11.625461496878415
    ct_subscript
    9.265848568174988
    
    Test
    OK
    

    Still a bit slower, but not by much.

    In other words, most of the overhead is caused by a slower hot loop, and mostly caused by the code that PySequence_SetItem wraps around Array_ass_item.

    This code might appear like little overhead at first read, but it really isn't.

    PySequence_SetItem actually calls into the entire Python machinery to resolve the __setitem__ method and call it.

    This eventually resolves in a call to Array_ass_item, but only after a large number of levels of indirection (which a direct call to Array_ass_item would bypass entirely!)

    Going through the rabbit hole, the call sequence looks a bit like this:

    • s->ob_type->tp_as_sequence->sq_ass_item points to slot_sq_ass_item.
    • slot_sq_ass_item calls into call_method.
    • call_method calls into PyObject_Call
    • And on and on until we eventually get to Array_ass_item..!

    In other words, we have C code in Array_init that's calling Python code (__setitem__) in a hot loop. That's slow.

    Why ?

    Now, why does Python use PySequence_SetItem in Array_init and not Array_ass_item in Array_init?

    That's because if it did, it would be bypassing the hooks that are exposed to the developer in Python-land.

    Indeed, you can intercept calls to sh[:] = ... by subclassing the array and overriding __setitem__ (__setslice__ in Python 2). It will be called once, with a slice argument for the index.

    Likewise, defining your own __setitem__ also overrides the logic in the constructor. It will be called N times, with an integer argument for the index.

    This means that if Array_init directly called into Array_ass_item, then you would lose something: __setitem__ would no longer be called in the constructor, and you wouldn't be able to override the behavior anymore.

    Now can we try to retain the faster speed all the while still exposing the same Python hooks?

    Well, perhaps, using this code in Array_init instead of the existing hot loop:

     return PySequence_SetSlice((PyObject*)self, 0, PyTuple_GET_SIZE(args), args);
    

    Using this will call into __setitem__ once with a slice argument (on Python 2, it would call into __setslice__). We still go through the Python hooks, but we only do it once instead of N times.

    Using this code, the performance becomes:

    sct_init
    12.24651838419959
    sct_subscript
    10.984305887017399
    ct_init
    12.138383641839027
    ct_subscript
    11.79078131634742
    
    Test
    OK
    

    Other overhead

    I think the rest of the overhead may be due to the tuple instantiation that takes place when calling __init__ on the array object (note the *, and the fact that Array_init expects a tuple for args) — this presumably scales with n as well.

    Indeed, if you replace sh[:] = l with sh[:] = tuple(l) in the test case, then the performance results become almost identical. With n = 100000:

    sct_init
    11.538272527977824
    sct_subscript
    10.985187001060694
    ct_init
    11.485244687646627
    ct_subscript
    10.843198659364134
    
    Test
    OK
    

    There's probably still something smaller going on, but ultimately we're comparing two substantially different hot loops. There's simply little reason to expect them to have identical performance.

    I think it might be interesting to try calling Array_ass_subscript from Array_init for the hot loop and see the results, though!

    Baseline speed

    Now, to your second question, regarding allocating shared memory.

    Note that there isn't really a cost to allocating shared memory. As noted in the results above, there isn't a substantial difference between using shared memory or not.

    Looking at the Numpy code (np.arange is implemented here), we can finally understand why it's so much faster than sct.RawArray: np.arange doesn't appear to make calls to Python "user-land" (i.e. no call to PySequence_GetItem or PySequence_SetItem).

    That doesn't necessarily explain all the difference, but you'd probably want to start investigating there.

    0 讨论(0)
  • 2021-02-13 05:49

    Not an answer (the accepted answer explains this quite well), but for those looking for how to fix this, here's how: Don't use RawArrays slice assignment operator.

    As noted in the accepted answer, RawArrays slice assignment operator doesn't take advantage of the fact that you're copying between two wrappers around C-style arrays of identical type and size. But RawArray implements the buffer protocol, so you can wrap it in a memoryview to access it in an "even more raw" way (and it will make Foo2 win, because you can only do this after constructing the object, not as part of construction):

    def foo2():
        sh = sct.RawArray(ct.c_int, len(l))
        # l must be another buffer protocol object w/the same C format, which is the case here
        memoryview(sh)[:] = l
        return sh
    

    In tests solving this problem on another question, the time to copy using a memoryview wrapper is less than 1% of the time required to copy with RawArrays normal slice assignment. One trick here is that the sizes of the elements of the output of np.random.randint are np.int, and on a 64 bit system, np.int is 64 bits, so on 64 bit Python, you need another round of copying to coerce it to the right size (or you need to declare the RawArray to be of a type that matches the size of np.int). Even if you do need to make that temporary copy though, it's still much cheaper with a memoryview:

    >>> l = np.random.randint(0, 10, size=100000)
    >>> %time sh = sct.RawArray(ct.c_int, len(l))
    Wall time: 472 µs  # Creation is cheap
    
    >>> %time sh[:] = l
    Wall time: 14.4 ms  # TOO LONG!
    
    # Must convert to numpy array with matching element size when c_int and np.int don't match
    >>> %time memoryview(sh)[:] = np.array(l, dtype=np.int32)
    Wall time: 424 µs
    

    As you can see, even when you need to copy the np.array to resize the elements first, the total time is less than 3% of the time required using RawArray's own slice assignment operator.

    If you avoid the temporary copy by making the size of the RawArray match the source, the cost drops further:

    # Make it 64 bit to match size of np.int on my machine
    >>> %time sh = sct.RawArray(ct.c_int64, len(l))
    Wall time: 522 µs  # Creation still cheap, even at double the size
    
    # No need to convert source array now:
    >>> %time memoryview(sh)[:] = l
    Wall time: 123 µs
    

    which gets us down to 0.85% of the RawArray slice assignment time; at this point, you're basically running at memcpy speeds; the rest of your actual Python code will swamp the miniscule amount of time spent on data copying.

    0 讨论(0)
提交回复
热议问题