Multiprocessing - Pipe vs Queue

后端 未结 2 1980
滥情空心
滥情空心 2020-11-28 00:53

What are the fundamental differences between queues and pipes in Python\'s multiprocessing package?

In what scenarios should one choose one over the other? When is

相关标签:
2条回答
  • 2020-11-28 01:19
    • A Pipe() can only have two endpoints.

    • A Queue() can have multiple producers and consumers.

    When to use them

    If you need more than two points to communicate, use a Queue().

    If you need absolute performance, a Pipe() is much faster because Queue() is built on top of Pipe().

    Performance Benchmarking

    Let's assume you want to spawn two processes and send messages between them as quickly as possible. These are the timing results of a drag race between similar tests using Pipe() and Queue()... This is on a ThinkpadT61 running Ubuntu 11.10, and Python 2.7.2.

    FYI, I threw in results for JoinableQueue() as a bonus; JoinableQueue() accounts for tasks when queue.task_done() is called (it doesn't even know about the specific task, it just counts unfinished tasks in the queue), so that queue.join() knows the work is finished.

    The code for each at bottom of this answer...

    mpenning@mpenning-T61:~$ python multi_pipe.py 
    Sending 10000 numbers to Pipe() took 0.0369849205017 seconds
    Sending 100000 numbers to Pipe() took 0.328398942947 seconds
    Sending 1000000 numbers to Pipe() took 3.17266988754 seconds
    mpenning@mpenning-T61:~$ python multi_queue.py 
    Sending 10000 numbers to Queue() took 0.105256080627 seconds
    Sending 100000 numbers to Queue() took 0.980564117432 seconds
    Sending 1000000 numbers to Queue() took 10.1611330509 seconds
    mpnening@mpenning-T61:~$ python multi_joinablequeue.py 
    Sending 10000 numbers to JoinableQueue() took 0.172781944275 seconds
    Sending 100000 numbers to JoinableQueue() took 1.5714070797 seconds
    Sending 1000000 numbers to JoinableQueue() took 15.8527247906 seconds
    mpenning@mpenning-T61:~$
    

    In summary Pipe() is about three times faster than a Queue(). Don't even think about the JoinableQueue() unless you really must have the benefits.

    BONUS MATERIAL 2

    Multiprocessing introduces subtle changes in information flow that make debugging hard unless you know some shortcuts. For instance, you might have a script that works fine when indexing through a dictionary in under many conditions, but infrequently fails with certain inputs.

    Normally we get clues to the failure when the entire python process crashes; however, you don't get unsolicited crash tracebacks printed to the console if the multiprocessing function crashes. Tracking down unknown multiprocessing crashes is hard without a clue to what crashed the process.

    The simplest way I have found to track down multiprocessing crash informaiton is to wrap the entire multiprocessing function in a try / except and use traceback.print_exc():

    import traceback
    def run(self, args):
        try:
            # Insert stuff to be multiprocessed here
            return args[0]['that']
        except:
            print "FATAL: reader({0}) exited while multiprocessing".format(args) 
            traceback.print_exc()
    

    Now, when you find a crash you see something like:

    FATAL: reader([{'crash': 'this'}]) exited while multiprocessing
    Traceback (most recent call last):
      File "foo.py", line 19, in __init__
        self.run(args)
      File "foo.py", line 46, in run
        KeyError: 'that'
    

    Source Code:


    """
    multi_pipe.py
    """
    from multiprocessing import Process, Pipe
    import time
    
    def reader_proc(pipe):
        ## Read from the pipe; this will be spawned as a separate Process
        p_output, p_input = pipe
        p_input.close()    # We are only reading
        while True:
            msg = p_output.recv()    # Read from the output pipe and do nothing
            if msg=='DONE':
                break
    
    def writer(count, p_input):
        for ii in xrange(0, count):
            p_input.send(ii)             # Write 'count' numbers into the input pipe
        p_input.send('DONE')
    
    if __name__=='__main__':
        for count in [10**4, 10**5, 10**6]:
            # Pipes are unidirectional with two endpoints:  p_input ------> p_output
            p_output, p_input = Pipe()  # writer() writes to p_input from _this_ process
            reader_p = Process(target=reader_proc, args=((p_output, p_input),))
            reader_p.daemon = True
            reader_p.start()     # Launch the reader process
    
            p_output.close()       # We no longer need this part of the Pipe()
            _start = time.time()
            writer(count, p_input) # Send a lot of stuff to reader_proc()
            p_input.close()
            reader_p.join()
            print("Sending {0} numbers to Pipe() took {1} seconds".format(count,
                (time.time() - _start)))
    

    """
    multi_queue.py
    """
    
    from multiprocessing import Process, Queue
    import time
    import sys
    
    def reader_proc(queue):
        ## Read from the queue; this will be spawned as a separate Process
        while True:
            msg = queue.get()         # Read from the queue and do nothing
            if (msg == 'DONE'):
                break
    
    def writer(count, queue):
        ## Write to the queue
        for ii in range(0, count):
            queue.put(ii)             # Write 'count' numbers into the queue
        queue.put('DONE')
    
    if __name__=='__main__':
        pqueue = Queue() # writer() writes to pqueue from _this_ process
        for count in [10**4, 10**5, 10**6]:             
            ### reader_proc() reads from pqueue as a separate process
            reader_p = Process(target=reader_proc, args=((pqueue),))
            reader_p.daemon = True
            reader_p.start()        # Launch reader_proc() as a separate python process
    
            _start = time.time()
            writer(count, pqueue)    # Send a lot of stuff to reader()
            reader_p.join()         # Wait for the reader to finish
            print("Sending {0} numbers to Queue() took {1} seconds".format(count, 
                (time.time() - _start)))
    

    """
    multi_joinablequeue.py
    """
    from multiprocessing import Process, JoinableQueue
    import time
    
    def reader_proc(queue):
        ## Read from the queue; this will be spawned as a separate Process
        while True:
            msg = queue.get()         # Read from the queue and do nothing
            queue.task_done()
    
    def writer(count, queue):
        for ii in xrange(0, count):
            queue.put(ii)             # Write 'count' numbers into the queue
    
    if __name__=='__main__':
        for count in [10**4, 10**5, 10**6]:
            jqueue = JoinableQueue() # writer() writes to jqueue from _this_ process
            # reader_proc() reads from jqueue as a different process...
            reader_p = Process(target=reader_proc, args=((jqueue),))
            reader_p.daemon = True
            reader_p.start()     # Launch the reader process
            _start = time.time()
            writer(count, jqueue) # Send a lot of stuff to reader_proc() (in different process)
            jqueue.join()         # Wait for the reader to finish
            print("Sending {0} numbers to JoinableQueue() took {1} seconds".format(count, 
                (time.time() - _start)))
    
    0 讨论(0)
  • 2020-11-28 01:24

    One additional feature of Queue() that is worth noting is the feeder thread. This section notes "When a process first puts an item on the queue a feeder thread is started which transfers objects from a buffer into the pipe." An infinite number of (or maxsize) items can be inserted into Queue() without any calls to queue.put() blocking. This allows you to store multiple items in a Queue(), until your program is ready to process them.

    Pipe(), on the other hand, has a finite amount of storage for items that have been sent to one connection, but have not been received from the other connection. After this storage is used up, calls to connection.send() will block until there is space to write the entire item. This will stall the thread doing the writing until some other thread reads from the pipe. Connection objects give you access to the underlying file descriptor. On *nix systems, you can prevent connection.send() calls from blocking using the os.set_blocking() function. However, this will cause problems if you try to send a single item that does not fit in the pipe's file. Recent versions of Linux allow you to increase the size of a file, but the maximum size allowed varies based on system configurations. You should therefore never rely on Pipe() to buffer data. Calls to connection.send could block until data gets read from the pipe somehwere else.

    In conclusion, Queue is a better choice than pipe when you need to buffer data. Even when you only need to communicate between two points.

    0 讨论(0)
提交回复
热议问题