Unreliable performance downloading files from S3 with boto and multiprocessing.Pool

问题

I want to download thousands of files from S3. To speed up the process I tried out Python's multiprocessing.Pool, but I the performance is very unreliable. Sometimes it works and it's much faster than the single core version, but often some files take several seconds so that the multiprocessing run takes longer than the single process one. A few times I even get a ssl.SSLError: The read operation timed out.

What could be the reason for that?

from time import time
from boto.s3.connection import S3Connection
from boto.s3.key import Key
from multiprocessing import Pool
import pickle

access_key=xxx
secret_key=xxx
bucket_name=xxx

path_list = pickle.load(open('filelist.pickle','r'))
conn = S3Connection(access_key, secret_key)
bucket = conn.get_bucket(bucket_name)
pool = Pool(32)


def read_file_from_s3(path):
    starttime = time()
    k = Key(bucket)
    k.key = path
    content = k.get_contents_as_string()
    print int((time()-starttime)*1000)
    return content


results = pool.map(read_file_from_s3, path_list) 
# or results = map(read_file_from_s3, path_list) for a single process comparison
pool.close()
pool.join()

[Update] I ended up only adding timeouts with retry (imap+.next(timeout)) to my multiprocessing code, but only because I did not want to change too much at the moment. If you want to do it right, use Jan-Philip's appraoch using gevent.

回答1:

"What could be the reason for that?"

Not enough detail. One reason could be that your private Internet connection is starving from too many concurrent connections. But since you did not specify in which environment you execute this piece of code, this is pure speculation.

What is no speculation, however, is that your approach to tackle this problem is very inefficient. multiprocessing is for solving CPU-bound problems. Retrieving data via multiple TCP connections at once is not a CPU-bound problem. Spawning one process per TCP connection is a waste of resources.

The reason why this seems to be slow is because in your case one process spends a lot of time waiting for system calls to return (the operating system on the other hand spends a lot of time waiting for the networking module to do what it was told to (and the networking component spends a lot of time waiting for packets to arrive over the wire)).

You do not need multiple processes for making your computer spend less time on waiting. You do not even need multiple threads. You can pull data from many TCP connections within a single OS-level thread, using cooperative scheduling. In Python, this is often done using greenlet. A higher level module making use of greenlets is gevent.

The web is full of gevent-based examples for firing off many HTTP requests -- concurrently. Given a proper Internet connection, a single OS-level thread can deal with hundreds or thousands or ten-thousands of concurrent connections simultaneously. In these orders of magnitude, the problem then evolves to be I/O-bound or CPU-bound, depending on the exact purpose of your application. That is, either the network connection or the CPU-memory bus or a single CPU core limit your application.

Regarding ssl.SSLError: The read operation timed out-like errors: in the world of networking, you have to account for such things to happen from time to time and decide (depending on the details of your application) how you want to deal with these situations. Often, a simple retry attempt is a good solution.

来源：https://stackoverflow.com/questions/27016077/unreliable-performance-downloading-files-from-s3-with-boto-and-multiprocessing-p

标签

python

amazon-s3

multiprocessing

boto