Urllib2 & BeautifulSoup : Nice couple but too slow - urllib3 & threads?

前端 未结 3 533
醉酒成梦
醉酒成梦 2021-01-31 06:38

I was looking to find a way to optimize my code when I heard some good things about threads and urllib3. Apparently, people disagree which solution is the best.

The pro

3条回答
  •  星月不相逢
    2021-01-31 07:19

    Consider using something like workerpool. Referring to the Mass Downloader example, combined with urllib3 would look something like:

    import workerpool
    import urllib3
    
    URL_LIST = [] # Fill this from somewhere
    
    NUM_SOCKETS = 3
    NUM_WORKERS = 5
    
    # We want a few more workers than sockets so that they have extra
    # time to parse things and such.
    
    http = urllib3.PoolManager(maxsize=NUM_SOCKETS)
    workers = workerpool.WorkerPool(size=NUM_WORKERS)
    
    class MyJob(workerpool.Job):
        def __init__(self, url):
           self.url = url
    
        def run(self):
            r = http.request('GET', self.url)
            # ... do parsing stuff here
    
    
    for url in URL_LIST:
        workers.put(MyJob(url))
    
    # Send shutdown jobs to all threads, and wait until all the jobs have been completed
    # (If you don't do this, the script might hang due to a rogue undead thread.)
    workers.shutdown()
    workers.wait()
    

    You may note from the Mass Downloader examples that there are multiple ways of doing this. I chose this particular example just because it's less magical, but any of the other strategies are valid also.

    Disclaimer: I am the author of both, urllib3 and workerpool.

提交回复
热议问题