I\'d like to know how multiprocessing is done right. Assuming I have a list [1,2,3,4,5]
generated by function f1
which is written to a Queue
I use concurent.futures
and three pools, which are connected together via future.add_done_callback
. Then I wait for the whole process to end by calling shutdown
on each pool.
from concurrent.futures import ProcessPoolExecutor
import time
import random
def worker1(arg):
time.sleep(random.random())
return arg
def pipe12(future):
pool2.submit(worker2, future.result()).add_done_callback(pipe23)
def worker2(arg):
time.sleep(random.random())
return arg
def pipe23(future):
pool3.submit(worker3, future.result()).add_done_callback(spout)
def worker3(arg):
time.sleep(random.random())
return arg
def spout(future):
print(future.result())
if __name__ == "__main__":
__spec__ = None # Fix multiprocessing in Spyder's IPython
pool1 = ProcessPoolExecutor(2)
pool2 = ProcessPoolExecutor(2)
pool3 = ProcessPoolExecutor(2)
for i in range(10):
pool1.submit(worker1, i).add_done_callback(pipe12)
pool1.shutdown()
pool2.shutdown()
pool3.shutdown()
What would be wrong with using idea 1, but with each worker process (f2) putting a custom object with its identifier when it is done? Then f3, would just terminate that worker, until there was no worker process left.
Also, new in Python 3.2 is the concurrent.futures package on the standard library, that should do what you are trying to in the "right way" (tm) - http://docs.python.org/dev/library/concurrent.futures.html
Maybe it is possible to find a backport of concurrent.futures to Python 2.x series.
For Idea 1, how about:
import multiprocessing as mp
sentinel=None
def f2(inq,outq):
while True:
val=inq.get()
if val is sentinel:
break
outq.put(val*2)
def f3(outq):
while True:
val=outq.get()
if val is sentinel:
break
print(val)
def f1():
num_workers=2
inq=mp.Queue()
outq=mp.Queue()
for i in range(5):
inq.put(i)
for i in range(num_workers):
inq.put(sentinel)
workers=[mp.Process(target=f2,args=(inq,outq)) for i in range(2)]
printer=mp.Process(target=f3,args=(outq,))
for w in workers:
w.start()
printer.start()
for w in workers:
w.join()
outq.put(sentinel)
printer.join()
if __name__=='__main__':
f1()
The only difference from the description of Idea 1 is that f2
breaks out of the while-loop
when it receives the sentinel (thus terminating itself). f1
blocks until the workers are done (using w.join()
) and then sends f3
the sentinel (signaling that it break out of its while-loop
).
The easiest way to do exactly that is using semaphores.
F1
F1 is populating your 'Queue' with the data you want to process. End the end of this push, you put n 'Stop' keywords in your queue. n = 2 for your example, but usually the number of involved workers. Code would look like:
for n in no_of_processes:
tasks.put('Stop')
F2
F2 is pulling from the provided queue by a get
-command. The element is taken from the queue and deleted in the queue. Now, you can put the pop into a loop while paying attention to the stop signal:
for elem in iter(tasks.get, 'STOP'):
do something
F3
This one is a bit tricky. You could generate a semaphore in F2 that acts as a signal to F3. But you do not know when this signal arrives and you may loose data. However, F3 pulls the data the same way as F2 and you could put that into a try... except
-statement.
queue.get
raises an queue.Empty
when there are no elements in the queue. So your pull in F3 would look like:
while control:
try:
results.get()
except queue.Empty:
control = False
With tasks
and results
being queues. So you do not need anything which is not already included in Python.
With MPipe module, simply do this:
from mpipe import OrderedStage, Pipeline
def f1(value):
return value * 2
def f2(value):
print(value)
s1 = OrderedStage(f1, size=2)
s2 = OrderedStage(f2)
p = Pipeline(s1.link(s2))
for task in 1, 2, 3, 4, 5, None:
p.put(task)
The above runs 4 processes:
The MPipe cookbook offers some explanation of how processes are shut down internally using None
as the last task.
To run the code, install MPipe:
virtualenv venv
venv/bin/pip install mpipe
venv/bin/python prog.py
Output:
2
4
6
8
10
Pypeline does this for you. You can even choose between using Processes, Threads or async Tasks. What you want is just e.g. using Processes:
import pypeln as pl
data = some_iterable()
data = pl.process.map(f2, data, workers = 3)
data = list(data)
You can do more complex stuff
import pypeln as pl
data = some_iterable()
data = pl.process.map(f2, data, workers = 3)
data = pl.process.filter(f3, data, workers = 1)
data = pl.process.flat_map(f4, data, workers = 5)
data = list(data)