speeding up urlib.urlretrieve

笑着哭i 提交于 2019-12-06 12:49:37

my code above was very naive as I did not take advantage of multi-threading. It obviously takes for url requests to be responded but there is no reason why the computer cannot make further requests while the proxy server responds.

Doing the following adjustments, you can improve efficiency by 10x - and there are further ways for improving efficiency, with packages such as scrapy.

To add multi-threading, do something like the following, using the multiprocessing package:

1) encapsulate the url retrieving in a function:

import import urllib.request

def geturl(link,i):
try:
    urllib.request.urlretrieve(link, str(i)+".jpg")
except:
    pass

2) then create a collection with all urls as well as names you want for the downloaded pictures:

urls = [url1,url2,url3,urln]
names = [i for i in range(0,len(urls))]

3)Import the Pool class from the multiprocessing package and create an object using such class (obviously you would include all imports in the first line of your code in a real program):

from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(100)

then use the pool.starmap() method and pass the function and the arguments of the function.

results = pool.starmap(geturl, zip(links, d))

note: pool.starmap() works only in Python 3

When a program enters I/O wait, the execution is paused so that the kernel can perform the low-level operations associated with the I/O request (this is called a context switch) and is not resumed until the I/O operation is completed.

Context switching is quite a heavy operation. It requires us to save the state of our program (losing any sort of caching we had at the CPU level) and give up the use of the CPU. Later, when we are allowed to run again, we must spend time reinitializing our program on the motherboard and getting ready to resume (of course, all this happens behind the scenes).

With concurrency, on the other hand, we typically have a thing called an “event loop” running that manages what gets to run in our program, and when. In essence, an event loop is simply a list of functions that need to be run. The function at the top of the list gets run, then the next, etc.

The following shows a simple example of an event loop:

from Queue import Queue
from functools import partial

eventloop = None

class EventLoop(Queue):
    def start(self):
        while True:
            function = self.get()
            function()

def do_hello():
    global eventloop
    print "Hello"
    eventloop.put(do_world)

def do_world():
    global eventloop
    print "world"
    eventloop.put(do_hello)

if __name__ == "__main__":
    eventloop = EventLoop()
    eventloop.put(do_hello)
    eventloop.start()

If the above seems like something you may use, and you'd also like to see how gevent, tornado, and AsyncIO, can help with your issue, then head out to your (University) library, check out High Performance Python by Micha Gorelick and Ian Ozsvald, and read pp. 181-202.

Note: above code and text are from the book mentioned.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!