I am a bit confused about multiproessing feature of mod_wsgi and about a general design of WSGI applications that would be executed on WSGI servers with multiprocessing ability.
If you are using multiprocessing
, there are multiple ways to share data between processes. Values and Arrays only work if processes have a parent/child relation (they are shared by inheriting). If that is not the case, use a Manager and Proxy objects.
There are several aspects to consider in your question.
First, the interaction between apache MPM's and mod_wsgi applications. If you run the mod_wsgi application in embedded mode (no WSGIDaemonProcess
needed, WSGIProcessGroup %{GLOBAL}
) you inherit multiprocessing/multithreading from the apache MPM's. This should be the fastest option, and you end up having multiple processes and multiple threads per process, depending on your MPM configuration. On the contrary if you run mod_wsgi in daemon mode, with WSGIDaemonProcess <name> [options]
and WSGIProcessGroup <name>
, you have fine control on multiprocessing/multithreading at the cost of a small overhead.
Within a single apache2 server you may define zero, one, or more named WSGIDaemonProcess
es, and each application can be run in one of these processes (WSGIProcessGroup <name>
) or run in embedded mode with WSGIProcessGroup %{GLOBAL}
.
You can check multiprocessing/multithreading by inspecting the wsgi.multithread
and wsgi.multiprocess
variables.
With your configuration WSGIDaemonProcess example processes=5 threads=1
you have 5 independent processes, each with a single thread of execution: no global data, no shared memory, since you are not in control of spawning subprocesses, but mod_wsgi is doing it for you. To share a global state you already listed some possible options: a DB to which your processes interface, some sort of file system based persistence, a daemon process (started outside apache) and socket based IPC.
As pointed out by Roland Smith, the latter could be implemented using a high level API by multiprocessing.managers: outside apache you create and start a BaseManager
server process
m = multiprocessing.managers.BaseManager(address=('', 12345), authkey='secret')
m.get_server().serve_forever()
and inside you apps you connect
:
m = multiprocessing.managers.BaseManager(address=('', 12345), authkey='secret')
m.connect()
The example above is dummy, since m
has no useful method registered, but here (python docs) you will find how to create and proxy an object (like the counter
in your example) among your processes.
A final comment on your example, with processes=5 threads=1
. I understand that this is just an example, but in real world applications I suspect that performance will be comparable with respect to processes=1 threads=5
: you should go into the intricacies of sharing data in multiprocessing only if the expected performance boost over the 'single process many threads' model is significant.
From the docs on processes and threading for wsgi:
When Apache is run in a mode whereby there are multiple child processes, each child process will contain sub interpreters for each WSGI application.
This means that in your configuration, 5 processes with 1 thread each, there will be 5 interpreters and no shared data. Your counter object will be unique to each interpreter. You would need to either build some custom solution to count sessions (one common process you can communicate with, some kind of persistence based solution, etc.) OR, and this is definitely my recommendation, use a prebuilt solution (Google Analytics and Chartbeat are fantastic options).
I tend to think of using globals to share data as a big form of global abuse. It's a bug well and portability issue in most of the environments I've done parallel processing in. What if suddenly your application was to be run on multiple virtual machines? This would break your code no matter what the sharing model of threads and processes.