multiprocessing queue full

ぃ、小莉子 提交于 2020-01-01 04:21:06

问题


I'm using concurrent.futures to implement multiprocessing. I am getting a queue.Full error, which is odd because I am only assigning 10 jobs.

A_list = [np.random.rand(2000, 2000) for i in range(10)]

with ProcessPoolExecutor() as pool:
    pool.map(np.linalg.svd, A_list)

error:

Exception in thread Thread-9:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/threading.py", line 921, in _bootstrap_inner
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/threading.py", line 869, in run
    self._target(*self._args, **self._kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/concurrent/futures/process.py", line 251, in _queue_management_worker
    shutdown_worker()
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/concurrent/futures/process.py", line 209, in shutdown_worker
    call_queue.put_nowait(None)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/queues.py", line 131, in put_nowait
    return self.put(obj, False)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/queues.py", line 82, in put
    raise Full
queue.Full

回答1:


Short Answer
I believe pipe size limits are the underlying cause. There isn't much you can do about this except break up your data into smaller chunks and deal with them iteratively. This means you may need to find a new algorithm that can work on small portions of your 2000x2000 array at a time to find the Singular Value Composition.

Details
Let's get one thing straight right away: you're dealing with a lot of information. Just because you're working with only ten items doesn't mean it's trivial. Each of those items is a 2000x2000 array full of 4,000,000 floats, which are usually 64 bits each, so you're looking at around 244MB per array, plus the other data that tags along in Numpy's ndarrays.

The ProcessPoolExecutor works by launching a separate thread to manage the worker processes. The management thread uses a multiprocesing.Queue to pass jobs to the workers, called _call_queue. These multiprocessing.Queues are actually just fancy wrappers around pipes, and the ndarrays you're trying to pass to the workers are likely too large for the pipes to handle properly.

Reading over Python Issue 8426 shows that figuring out exactly how big your pipes can be difficult, even when you can look up some nominal pipe size limit for your OS. There are too many variables to make it simple. Even the order that things are pulled off of the queue can induce race conditions in the underlying pipe that trigger odd errors.

I suspect that one of your workers is getting an getting an incomplete or corrupted object off of its _call_queue, because that queue's pipe is full of your giant objects. That worker dies in an unclean way, and the work queue manager detects this failure, so it gives up on the work and tells the remaining workers to exit. But it does this by passing them poison pills over _call_queue, which is still full of your giant ndarrays. This is why you got the full queue exception - your data filled up the queue, then the management thread tried to use the same queue to pass control messages to the other workers.

I think this is a classic example of the potential dangers of mixing data and control flows between different entities in a program. Your large data not only blocked more data from being received by the workers, it also blocked the manager's control communications with the workers because they use the same path.

I haven't been able to recreate your failure, so I can't be sure that all of this is correct. But the fact that you can make this code work with a 200x200 array (~2.5MB) seems to support this theory. Nominal pipe size limits seem to be measured in KB or a few MB at most, depending on the OS and architecture. The fact that this amount of data can get through the pipes isn't surprising, especially when you consider that not all of the 2.5MB needs to actually fit in the pipe at once if a consumer is continuously receiving the data. It suggests a reasonable upper bound on the amount of data that you could get serially through a pipe.




回答2:


I've recently stumbled upon this, while debugging a python3.6 program which sends various GBs of data over the pipes. This is what I found (hoping it could save someone else's time!).

Like skrrgwasme said, if the queue manager is unable to acquire a semaphore while sending a poison pill, it raises a queue Full error. The acquire call to the semaphore is non-blocking and it causes the manager to fail (it's unable to send a 'control' command due to data and control flow sharing the same Queue). Note that the links above refer to python 3.6.0

Now I was wondering why my queue manager would send the poison pill. There must have been some other failure! Apparently some exception had happened (in some other subprocess? in the parent?), and the queue manager was trying to clean up and shut down all the subprocesses. At this point I was interested in finding this root cause.

Debugging the root cause

I initially tried logging all exceptions in the subprocesses but apparently no explicit error happened there. From issue 3895:

Note that multiprocessing.Pool is also broken when a result fails at unpickle.

it seems that the multiprocessing module is broken in py36, in that it won't catch and treat a serialization error correctly.

Unfortunately, due to time constraints I didn't manage to replicate and verify the problem myself, preferring to jump to the action points and better programming practices (don't send all that data through pipes :). Here's a couple of ideas:

  1. Try to pickle the data supposed to run through the pipes. Due to the huge nature of my data (hundreds of GBs) and time constraints, I didn't manage to find which records were unserializable.
  2. Put a debugger into python3.6 and print the original exception.

Action points

  1. Remodel your program to send less data through the pipes if possible.

  2. After reading issue 3895 it appears the problem arises with pickling errors. An alternative (and good programming practice) could be to transfer the data using different means. For example one could have the subprocesses write to files and return the paths to the parent process (this would be just a small string, probably a few bytes).

  3. Wait for future python versions. Apparently this was fixed on python version tag v3.7.0b3 in the context of issue 3895. The Full exception will be handled inside shutdown_worker. The current maintenance version of Python at the time of writing is 3.6.5



来源:https://stackoverflow.com/questions/31552716/multiprocessing-queue-full

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!