speeding up urlib.urlretrieve

问题

I am downloading pictures from the internet, and as it turns out, I need to download lots of pictures. I am using a version of the following code fragment (actually looping through the links I intend to download and downloading the pictures :

import urllib
urllib.urlretrieve(link, filename)

I am downloading roughly 1000 pictures every 15 minutes, which is awfully slow based on the number of pictures I need to download.

For efficiency, I set a timeout every 5 seconds (still many downloads last much longer):

import socket
socket.setdefaulttimeout(5)

Besides running a job on a computer cluster to parallelize downloads, is there a way to make the picture download faster / more efficient?

回答1:

my code above was very naive as I did not take advantage of multi-threading. It obviously takes for url requests to be responded but there is no reason why the computer cannot make further requests while the proxy server responds.

Doing the following adjustments, you can improve efficiency by 10x - and there are further ways for improving efficiency, with packages such as scrapy.

To add multi-threading, do something like the following, using the multiprocessing package:

1) encapsulate the url retrieving in a function:

import import urllib.request

def geturl(link,i):
try:
    urllib.request.urlretrieve(link, str(i)+".jpg")
except:
    pass

2) then create a collection with all urls as well as names you want for the downloaded pictures:

urls = [url1,url2,url3,urln]
names = [i for i in range(0,len(urls))]

3)Import the Pool class from the multiprocessing package and create an object using such class (obviously you would include all imports in the first line of your code in a real program):

from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(100)

then use the pool.starmap() method and pass the function and the arguments of the function.

results = pool.starmap(geturl, zip(links, d))

note: pool.starmap() works only in Python 3

回答2:

When a program enters I/O wait, the execution is paused so that the kernel can perform the low-level operations associated with the I/O request (this is called a context switch) and is not resumed until the I/O operation is completed.

Context switching is quite a heavy operation. It requires us to save the state of our program (losing any sort of caching we had at the CPU level) and give up the use of the CPU. Later, when we are allowed to run again, we must spend time reinitializing our program on the motherboard and getting ready to resume (of course, all this happens behind the scenes).

With concurrency, on the other hand, we typically have a thing called an “event loop” running that manages what gets to run in our program, and when. In essence, an event loop is simply a list of functions that need to be run. The function at the top of the list gets run, then the next, etc.

The following shows a simple example of an event loop:

from Queue import Queue
from functools import partial

eventloop = None

class EventLoop(Queue):
    def start(self):
        while True:
            function = self.get()
            function()

def do_hello():
    global eventloop
    print "Hello"
    eventloop.put(do_world)

def do_world():
    global eventloop
    print "world"
    eventloop.put(do_hello)

if __name__ == "__main__":
    eventloop = EventLoop()
    eventloop.put(do_hello)
    eventloop.start()

If the above seems like something you may use, and you'd also like to see how gevent, tornado, and AsyncIO, can help with your issue, then head out to your (University) library, check out High Performance Python by Micha Gorelick and Ian Ozsvald, and read pp. 181-202.

Note: above code and text are from the book mentioned.

来源：https://stackoverflow.com/questions/40166757/speeding-up-urlib-urlretrieve

标签

python

urllib