I was looking to find a way to optimize my code when I heard some good things about threads and urllib3. Apparently, people disagree which solution is the best.
The pro
Consider using something like workerpool. Referring to the Mass Downloader example, combined with urllib3 would look something like:
import workerpool
import urllib3
URL_LIST = [] # Fill this from somewhere
NUM_SOCKETS = 3
NUM_WORKERS = 5
# We want a few more workers than sockets so that they have extra
# time to parse things and such.
http = urllib3.PoolManager(maxsize=NUM_SOCKETS)
workers = workerpool.WorkerPool(size=NUM_WORKERS)
class MyJob(workerpool.Job):
def __init__(self, url):
self.url = url
def run(self):
r = http.request('GET', self.url)
# ... do parsing stuff here
for url in URL_LIST:
workers.put(MyJob(url))
# Send shutdown jobs to all threads, and wait until all the jobs have been completed
# (If you don't do this, the script might hang due to a rogue undead thread.)
workers.shutdown()
workers.wait()
You may note from the Mass Downloader examples that there are multiple ways of doing this. I chose this particular example just because it's less magical, but any of the other strategies are valid also.
Disclaimer: I am the author of both, urllib3 and workerpool.