In python it is possible to share ctypes objects between multiple processes. However I notice that allocating these objects seems to be extremely expensive.
Consider fol
This should be a comment, but I do not have enough reputation :-(
Starting with Python 3.5, shared arrays in Linux are created as temp files mapped to memory (see https://bugs.python.org/issue30919). I think this explains why creating a Numpy array, which is created in memory, is faster than creating and initializing a large shared array. To force Python to use shared memory, a workaround is to execute these two lines of code (ref. No space left while using Multiprocessing.Array in shared memory):
from multiprocessing.process import current_process
current_process()._config[‘tempdir’] = ‘/dev/shm’
I rewrote your sample code a little bit to look into this issue. Here's where I landed, I'll use it in my answer below:
so.py
:
from multiprocessing import sharedctypes as sct
import ctypes as ct
import numpy as np
n = 100000
l = np.random.randint(0, 10, size=n)
def sct_init():
sh = sct.RawArray(ct.c_int, l)
return sh
def sct_subscript():
sh = sct.RawArray(ct.c_int, n)
sh[:] = l
return sh
def ct_init():
sh = (ct.c_int * n)(*l)
return sh
def ct_subscript():
sh = (ct.c_int * n)(n)
sh[:] = l
return sh
Note that I added two test cases that do not use shared memory (and use regular a ctypes
array instead).
timer.py
:
import traceback
from timeit import timeit
for t in ["sct_init", "sct_subscript", "ct_init", "ct_subscript"]:
print(t)
try:
print(timeit("{0}()".format(t), setup="from so import {0}".format(t), number=100))
except Exception as e:
print("Failed:", e)
traceback.print_exc()
print
print()
print ("Test",)
from so import *
sh1 = sct_init()
sh2 = sct_subscript()
for i in range(n):
assert sh1[i] == sh2[i]
print("OK")
The results from running the above code using Python 3.6a0 (specifically 3c2fbdb) are:
sct_init
2.844902500975877
sct_subscript
0.9383537038229406
ct_init
2.7903486443683505
ct_subscript
0.978101353161037
Test
OK
What's interesting is that if you change n
, the results scale linearly. For example, using n = 100000
(10 times bigger), you get something that's pretty much 10 times slower:
sct_init
30.57974253082648
sct_subscript
9.48625904135406
ct_init
30.509132395964116
ct_subscript
9.465419146697968
Test
OK
In the end, the speed difference lies in the hot loop that is called to initialize the array by copying every single value over from the Numpy array (l
) to the new array (sh
). This makes sense, because as we noted speed scales linearly with array size.
When you pass the Numpy array as a constructor argument, the function that does this is Array_init. However, if you assign using sh[:] = l
, then it's Array_ass_subscript that does the job.
Again, what matters here are the hot loops. Let's look at them.
Array_init
hot loop (slower):
for (i = 0; i < n; ++i) {
PyObject *v;
v = PyTuple_GET_ITEM(args, i);
if (-1 == PySequence_SetItem((PyObject *)self, i, v))
return -1;
}
Array_ass_subscript
hot loop (faster):
for (cur = start, i = 0; i < otherlen; cur += step, i++) {
PyObject *item = PySequence_GetItem(value, i);
int result;
if (item == NULL)
return -1;
result = Array_ass_item(myself, cur, item);
Py_DECREF(item);
if (result == -1)
return -1;
}
As it turns out, the majority of the speed difference lies in using PySequence_SetItem
vs. Array_ass_item
.
Indeed, if you change the code for Array_init
to use Array_ass_item
instead of PySequence_SetItem
(if (-1 == Array_ass_item((PyObject *)self, i, v))
), and recompile Python, the new results become:
sct_init
11.504781467840075
sct_subscript
9.381130554247648
ct_init
11.625461496878415
ct_subscript
9.265848568174988
Test
OK
Still a bit slower, but not by much.
In other words, most of the overhead is caused by a slower hot loop, and mostly caused by the code that PySequence_SetItem wraps around Array_ass_item.
This code might appear like little overhead at first read, but it really isn't.
PySequence_SetItem
actually calls into the entire Python machinery to resolve the __setitem__
method and call it.
This eventually resolves in a call to Array_ass_item
, but only after a large number of levels of indirection (which a direct call to Array_ass_item
would bypass entirely!)
Going through the rabbit hole, the call sequence looks a bit like this:
s->ob_type->tp_as_sequence->sq_ass_item
points to slot_sq_ass_item.slot_sq_ass_item
calls into call_method.call_method
calls into PyObject_CallArray_ass_item
..!In other words, we have C code in Array_init
that's calling Python code (__setitem__
) in a hot loop. That's slow.
Now, why does Python use PySequence_SetItem
in Array_init
and not Array_ass_item
in Array_init
?
That's because if it did, it would be bypassing the hooks that are exposed to the developer in Python-land.
Indeed, you can intercept calls to sh[:] = ...
by subclassing the array and overriding __setitem__
(__setslice__
in Python 2). It will be called once, with a slice
argument for the index.
Likewise, defining your own __setitem__
also overrides the logic in the constructor. It will be called N times, with an integer argument for the index.
This means that if Array_init
directly called into Array_ass_item
, then you would lose something: __setitem__
would no longer be called in the constructor, and you wouldn't be able to override the behavior anymore.
Now can we try to retain the faster speed all the while still exposing the same Python hooks?
Well, perhaps, using this code in Array_init
instead of the existing hot loop:
return PySequence_SetSlice((PyObject*)self, 0, PyTuple_GET_SIZE(args), args);
Using this will call into __setitem__
once with a slice argument (on Python 2, it would call into __setslice__
). We still go through the Python hooks, but we only do it once instead of N times.
Using this code, the performance becomes:
sct_init
12.24651838419959
sct_subscript
10.984305887017399
ct_init
12.138383641839027
ct_subscript
11.79078131634742
Test
OK
I think the rest of the overhead may be due to the tuple instantiation that takes place when calling __init__ on the array object (note the *
, and the fact that Array_init
expects a tuple for args
) — this presumably scales with n
as well.
Indeed, if you replace sh[:] = l
with sh[:] = tuple(l)
in the test case, then the performance results become almost identical. With n = 100000
:
sct_init
11.538272527977824
sct_subscript
10.985187001060694
ct_init
11.485244687646627
ct_subscript
10.843198659364134
Test
OK
There's probably still something smaller going on, but ultimately we're comparing two substantially different hot loops. There's simply little reason to expect them to have identical performance.
I think it might be interesting to try calling Array_ass_subscript
from Array_init
for the hot loop and see the results, though!
Now, to your second question, regarding allocating shared memory.
Note that there isn't really a cost to allocating shared memory. As noted in the results above, there isn't a substantial difference between using shared memory or not.
Looking at the Numpy code (np.arange
is implemented here), we can finally understand why it's so much faster than sct.RawArray
: np.arange
doesn't appear to make calls to Python "user-land" (i.e. no call to PySequence_GetItem
or PySequence_SetItem
).
That doesn't necessarily explain all the difference, but you'd probably want to start investigating there.
Not an answer (the accepted answer explains this quite well), but for those looking for how to fix this, here's how: Don't use RawArray
s slice assignment operator.
As noted in the accepted answer, RawArray
s slice assignment operator doesn't take advantage of the fact that you're copying between two wrappers around C-style arrays of identical type and size. But RawArray
implements the buffer protocol, so you can wrap it in a memoryview to access it in an "even more raw" way (and it will make Foo2
win, because you can only do this after constructing the object, not as part of construction):
def foo2():
sh = sct.RawArray(ct.c_int, len(l))
# l must be another buffer protocol object w/the same C format, which is the case here
memoryview(sh)[:] = l
return sh
In tests solving this problem on another question, the time to copy using a memoryview
wrapper is less than 1% of the time required to copy with RawArray
s normal slice assignment.
One trick here is that the sizes of the elements of the output of np.random.randint
are np.int
, and on a 64 bit system, np.int
is 64 bits, so on 64 bit Python, you need another round of copying to coerce it to the right size (or you need to declare the RawArray
to be of a type that matches the size of np.int
). Even if you do need to make that temporary copy though, it's still much cheaper with a memoryview
:
>>> l = np.random.randint(0, 10, size=100000)
>>> %time sh = sct.RawArray(ct.c_int, len(l))
Wall time: 472 µs # Creation is cheap
>>> %time sh[:] = l
Wall time: 14.4 ms # TOO LONG!
# Must convert to numpy array with matching element size when c_int and np.int don't match
>>> %time memoryview(sh)[:] = np.array(l, dtype=np.int32)
Wall time: 424 µs
As you can see, even when you need to copy the np.array
to resize the elements first, the total time is less than 3% of the time required using RawArray
's own slice assignment operator.
If you avoid the temporary copy by making the size of the RawArray
match the source, the cost drops further:
# Make it 64 bit to match size of np.int on my machine
>>> %time sh = sct.RawArray(ct.c_int64, len(l))
Wall time: 522 µs # Creation still cheap, even at double the size
# No need to convert source array now:
>>> %time memoryview(sh)[:] = l
Wall time: 123 µs
which gets us down to 0.85% of the RawArray
slice assignment time; at this point, you're basically running at memcpy
speeds; the rest of your actual Python code will swamp the miniscule amount of time spent on data copying.