multithreaded file download in python and updating in shell with download progress

问题

in an attempt to learn multithreaded file download I wrote this piece of cake:

import urllib2
import os
import sys
import time
import threading

urls = ["http://broadcast.lds.org/churchmusic/MP3/1/2/nowords/271.mp3",
"http://s1.fans.ge/mp3/201109/08/John_Legend_So_High_Remix(fans_ge).mp3",
"http://megaboon.com/common/preview/track/786203.mp3"]

url = urls[1]

def downloadFile(url, saveTo=None):
    file_name = url.split('/')[-1]
    if not saveTo:
        saveTo = '/Users/userName/Desktop'
    try:
        u = urllib2.urlopen(url)
    except urllib2.URLError , er:
        print("%s" % er.reason)
    else:

        f = open(os.path.join(saveTo, file_name), 'wb')
        meta = u.info()
        file_size = int(meta.getheaders("Content-Length")[0])
        print "Downloading: %s Bytes: %s" % (file_name, file_size)
        file_size_dl = 0
        block_sz = 8192
        while True:
            buffer = u.read(block_sz)
            if not buffer:
                break

            file_size_dl += len(buffer)
            f.write(buffer)
            status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
            status = status + chr(8)*(len(status)+1)
            sys.stdout.write('%s\r' % status)
            time.sleep(.2)
            sys.stdout.flush()
            if file_size_dl == file_size:
                print r"Download Completed %s%% for file %s, saved to %s" % (file_size_dl * 100. / file_size, file_name, saveTo,)
        f.close()
        return


def synchronusDownload():
    urls_saveTo = {urls[0]: None, urls[1]: None, urls[2]: None}
    for url, saveTo in urls_saveTo.iteritems():
        th = threading.Thread(target=downloadFile, args=(url, saveTo), name="%s_Download_Thread" % os.path.basename(url))
        th.start()

synchronusDownload()

but it seems like for the initiation of the second download it waits for the first thread and then goes to download the next file, as printed in shell too.

my plan was to begin all downloads simultaneously and print the updated progress of the files getting downloaded.

Any help will be greatly appreciated. thanks.

回答1:

This is a common problem and here are the steps typically taken:

1.) use Queue.Queue to create a queue of all the urls you would like to visit.

2.) Create a class that inherits from threading.Thread. It should have a run method that grabs a url from the queue and gets the data.

3.) Create a pool of threads based on your class to be "workers"

4.) Don't exit the program until queue.join() has been completed

回答2:

Your functions are actually running in parallel. You can verify this by printing at the start of each function - 3 outputs will be printed as soon as your program is started.

What's happening is your first two files are so small that they are completely downloaded before the scheduler switches threads. Try setting bigger files in your list:

urls = [
"http://www.wswd.net/testdownloadfiles/50MB.zip",
"http://www.wswd.net/testdownloadfiles/20MB.zip",
"http://www.wswd.net/testdownloadfiles/100MB.zip",
]

Program output:

Downloading: 100MB.zip Bytes: 104857600
Downloading: 20MB.zip Bytes: 20971520
Downloading: 50MB.zip Bytes: 52428800
Download Completed 100.0% for file 20MB.zip, saved to .
Download Completed 100.0% for file 50MB.zip, saved to .
Download Completed 100.0% for file 100MB.zip, saved to .

来源：https://stackoverflow.com/questions/24216760/multithreaded-file-download-in-python-and-updating-in-shell-with-download-progre