In python it is possible to share ctypes objects between multiple processes. However I notice that allocating these objects seems to be extremely expensive.
Consider fol
I rewrote your sample code a little bit to look into this issue. Here's where I landed, I'll use it in my answer below:
so.py
:
from multiprocessing import sharedctypes as sct
import ctypes as ct
import numpy as np
n = 100000
l = np.random.randint(0, 10, size=n)
def sct_init():
sh = sct.RawArray(ct.c_int, l)
return sh
def sct_subscript():
sh = sct.RawArray(ct.c_int, n)
sh[:] = l
return sh
def ct_init():
sh = (ct.c_int * n)(*l)
return sh
def ct_subscript():
sh = (ct.c_int * n)(n)
sh[:] = l
return sh
Note that I added two test cases that do not use shared memory (and use regular a ctypes
array instead).
timer.py
:
import traceback
from timeit import timeit
for t in ["sct_init", "sct_subscript", "ct_init", "ct_subscript"]:
print(t)
try:
print(timeit("{0}()".format(t), setup="from so import {0}".format(t), number=100))
except Exception as e:
print("Failed:", e)
traceback.print_exc()
print
print()
print ("Test",)
from so import *
sh1 = sct_init()
sh2 = sct_subscript()
for i in range(n):
assert sh1[i] == sh2[i]
print("OK")
The results from running the above code using Python 3.6a0 (specifically 3c2fbdb) are:
sct_init
2.844902500975877
sct_subscript
0.9383537038229406
ct_init
2.7903486443683505
ct_subscript
0.978101353161037
Test
OK
What's interesting is that if you change n
, the results scale linearly. For example, using n = 100000
(10 times bigger), you get something that's pretty much 10 times slower:
sct_init
30.57974253082648
sct_subscript
9.48625904135406
ct_init
30.509132395964116
ct_subscript
9.465419146697968
Test
OK
In the end, the speed difference lies in the hot loop that is called to initialize the array by copying every single value over from the Numpy array (l
) to the new array (sh
). This makes sense, because as we noted speed scales linearly with array size.
When you pass the Numpy array as a constructor argument, the function that does this is Array_init. However, if you assign using sh[:] = l
, then it's Array_ass_subscript that does the job.
Again, what matters here are the hot loops. Let's look at them.
Array_init
hot loop (slower):
for (i = 0; i < n; ++i) {
PyObject *v;
v = PyTuple_GET_ITEM(args, i);
if (-1 == PySequence_SetItem((PyObject *)self, i, v))
return -1;
}
Array_ass_subscript
hot loop (faster):
for (cur = start, i = 0; i < otherlen; cur += step, i++) {
PyObject *item = PySequence_GetItem(value, i);
int result;
if (item == NULL)
return -1;
result = Array_ass_item(myself, cur, item);
Py_DECREF(item);
if (result == -1)
return -1;
}
As it turns out, the majority of the speed difference lies in using PySequence_SetItem
vs. Array_ass_item
.
Indeed, if you change the code for Array_init
to use Array_ass_item
instead of PySequence_SetItem
(if (-1 == Array_ass_item((PyObject *)self, i, v))
), and recompile Python, the new results become:
sct_init
11.504781467840075
sct_subscript
9.381130554247648
ct_init
11.625461496878415
ct_subscript
9.265848568174988
Test
OK
Still a bit slower, but not by much.
In other words, most of the overhead is caused by a slower hot loop, and mostly caused by the code that PySequence_SetItem wraps around Array_ass_item.
This code might appear like little overhead at first read, but it really isn't.
PySequence_SetItem
actually calls into the entire Python machinery to resolve the __setitem__
method and call it.
This eventually resolves in a call to Array_ass_item
, but only after a large number of levels of indirection (which a direct call to Array_ass_item
would bypass entirely!)
Going through the rabbit hole, the call sequence looks a bit like this:
s->ob_type->tp_as_sequence->sq_ass_item
points to slot_sq_ass_item.slot_sq_ass_item
calls into call_method.call_method
calls into PyObject_CallArray_ass_item
..!In other words, we have C code in Array_init
that's calling Python code (__setitem__
) in a hot loop. That's slow.
Now, why does Python use PySequence_SetItem
in Array_init
and not Array_ass_item
in Array_init
?
That's because if it did, it would be bypassing the hooks that are exposed to the developer in Python-land.
Indeed, you can intercept calls to sh[:] = ...
by subclassing the array and overriding __setitem__
(__setslice__
in Python 2). It will be called once, with a slice
argument for the index.
Likewise, defining your own __setitem__
also overrides the logic in the constructor. It will be called N times, with an integer argument for the index.
This means that if Array_init
directly called into Array_ass_item
, then you would lose something: __setitem__
would no longer be called in the constructor, and you wouldn't be able to override the behavior anymore.
Now can we try to retain the faster speed all the while still exposing the same Python hooks?
Well, perhaps, using this code in Array_init
instead of the existing hot loop:
return PySequence_SetSlice((PyObject*)self, 0, PyTuple_GET_SIZE(args), args);
Using this will call into __setitem__
once with a slice argument (on Python 2, it would call into __setslice__
). We still go through the Python hooks, but we only do it once instead of N times.
Using this code, the performance becomes:
sct_init
12.24651838419959
sct_subscript
10.984305887017399
ct_init
12.138383641839027
ct_subscript
11.79078131634742
Test
OK
I think the rest of the overhead may be due to the tuple instantiation that takes place when calling __init__ on the array object (note the *
, and the fact that Array_init
expects a tuple for args
) — this presumably scales with n
as well.
Indeed, if you replace sh[:] = l
with sh[:] = tuple(l)
in the test case, then the performance results become almost identical. With n = 100000
:
sct_init
11.538272527977824
sct_subscript
10.985187001060694
ct_init
11.485244687646627
ct_subscript
10.843198659364134
Test
OK
There's probably still something smaller going on, but ultimately we're comparing two substantially different hot loops. There's simply little reason to expect them to have identical performance.
I think it might be interesting to try calling Array_ass_subscript
from Array_init
for the hot loop and see the results, though!
Now, to your second question, regarding allocating shared memory.
Note that there isn't really a cost to allocating shared memory. As noted in the results above, there isn't a substantial difference between using shared memory or not.
Looking at the Numpy code (np.arange
is implemented here), we can finally understand why it's so much faster than sct.RawArray
: np.arange
doesn't appear to make calls to Python "user-land" (i.e. no call to PySequence_GetItem
or PySequence_SetItem
).
That doesn't necessarily explain all the difference, but you'd probably want to start investigating there.