I\'m creating a python script which accepts a path to a remote file and an n number of threads. The file\'s size will be divided by the number of threads, when each thread c
for "download several files simultaneously", I recommond this article: Practical threaded programming with Python . It provides a simultaneously download related example by combining threads with Queues, I thought it's worth a reading.
You can use a thread safe "semaphore", like this:
class Counter:
counter = 0
@classmethod
def inc(cls):
n = cls.counter = cls.counter + 1 # atomic increment and assignment
return n
Using Counter.inc() returns an incremented number across threads, which you can use to keep track of the current block of bytes.
That being said, there's no need to split up file downloads into several threads, because the downstream is way slower than the writing to disk, so one thread will always finish before the next one is downloading.
The best and least resource hungry way is simply to have a download file descriptor linked directly to a file object on disk.
You could coordinate the works with locks &c, but I recommend instead using Queue -- usually the best way to coordinate multi-threading (and multi-processing) in Python.
I would have the main thread spawn as many worker threads as you think appropriate (you may want to calibrate between performance, and load on the remote server, by experimenting); every worker thread waits at the same global Queue.Queue
instance, call it workQ
for example, for "work requests" (wr = workQ.get()
will do it properly -- each work request is obtained by a single worker thread, no fuss, no muss).
A "work request" can in this case simply be a triple (tuple with three items): identification of the remote file (URL or whatever), offset from which it is requested to get data from it, number of bytes to get from it (note that this works just as well for one or multiple files ot fetch).
The main thread pushes all work requests to the workQ
(just workQ.put((url, from, numbytes))
for each request) and waits for results to come to another Queue
instance, call it resultQ
(each result will also be a triple: identifier of the file, starting offset, string of bytes that are the results from that file at that offset).
As each working thread satisfies the request it's doing, it puts the results into resultQ
and goes back to fetch another work request (or wait for one). Meanwhile the main thread (or a separate dedicated "writing thread" if needed -- i.e. if the main thread has other work to do, for example on the GUI) gets results from resultQ
and performs the needed open
, seek
, and write
operations to place the data at the right spot.
There are several ways to terminate the operation: for example, a special work request may be asking the thread receiving it to terminate -- the main thread puts on workQ
just as many of those as there are working threads, after all the actual work requests, then joins all the worker threads when all data have been received and written (many alternatives exist, such as joining the queue directly, having the worker threads daemonic so they just go away when the main thread terminates, and so forth).
You need to fetch completely separate parts of the file on each thread. Calculate the chunk start and end positions based on the number of threads. Each chunk must have no overlap obviously.
For example, if target file was 3000 bytes long and you want to fetch using three thread:
You would pre-allocate an empty file of the original size, and write back to the respective positions within the file.