Urllib2 & BeautifulSoup : Nice couple but too slow - urllib3 & threads?

前端未结

关注

 3  533

醉酒成梦 2021-01-31 06:38

I was looking to find a way to optimize my code when I heard some good things about threads and urllib3. Apparently, people disagree which solution is the best.

The pro

3条回答

星月不相逢 (楼主)

2021-01-31 07:19

Consider using something like workerpool. Referring to the Mass Downloader example, combined with urllib3 would look something like:

import workerpool
import urllib3

URL_LIST = [] # Fill this from somewhere

NUM_SOCKETS = 3
NUM_WORKERS = 5

# We want a few more workers than sockets so that they have extra
# time to parse things and such.

http = urllib3.PoolManager(maxsize=NUM_SOCKETS)
workers = workerpool.WorkerPool(size=NUM_WORKERS)

class MyJob(workerpool.Job):
    def __init__(self, url):
       self.url = url

    def run(self):
        r = http.request('GET', self.url)
        # ... do parsing stuff here


for url in URL_LIST:
    workers.put(MyJob(url))

# Send shutdown jobs to all threads, and wait until all the jobs have been completed
# (If you don't do this, the script might hang due to a rogue undead thread.)
workers.shutdown()
workers.wait()

You may note from the Mass Downloader examples that there are multiple ways of doing this. I chose this particular example just because it's less magical, but any of the other strategies are valid also.

Disclaimer: I am the author of both, urllib3 and workerpool.

0 讨论(0)

查看其它3个回答