问题
I am writing a simple synchronous download manager which downloads a video file in 10 sections. I am using requests
to get content-length from headers. Using this I am breaking and downloading files in 10; byte chunks and then merging them to form a complete video. The code below suppose to work this way but the end merged file only works for seconds and after that it gets corrupted. What is wrong in my code?
import requests
import os
def intervals(parts, duration):
part_duration = duration // parts
return [(i * part_duration, (i + 1) * part_duration) for i in range(parts)]
home = os.path.expanduser("~")
if not os.path.exists(home+'/Desktop/temp'):
os.makedirs(home+'/Desktop/temp')
PATH = home+"/Desktop/temp/tmp.mp4"
example_file_url = "https://file-examples-com.github.io/uploads/2017/04/file_example_MP4_1280_10MG.mp4"
req = requests.head(example_file_url)
size = int(req.headers['Content-Length'])
content_section = 10
section_intervals = intervals(content_section,size)
with open(PATH, "wb") as file:
for i,(start,end) in enumerate(section_intervals):
headers = {"Range": "bytes="+str(start)+"-"+str(end)}
print(headers)
r = requests.get(example_file_url, headers=headers)
file.write(r.content)
回答1:
The problem
Your ranges are wrong because the interval specified by a Range
header gives the first and the last offset, e.g. bytes=0-10
means 11 bytes from 0 to 10 (unlike how slices work in python), so bytes=0-10
and bytes=10-20
are overlapping ranges. For example, you would need bytes=0-9
followed by bytes=10-19
instead.
See the example in this documentation:
header requesting the first 1024 bytes ...
Range: bytes=0-1023
(whereas [0:1023]
in a python slice would be length 1023).
Where you say that it "works for seconds and after that gets corrupted", I assume that you mean that it is valid for the first few seconds of decoded MP4 output. The point where it breaks will be the end of the first downloaded part, where the final byte of the first part is duplicated at the start of the second part.
Another problem is that your total length is wrong because you do integer division by parts
and then by the time that you multiply it up again, you have lost the final fractional part.
The fix
Change your intervals
function to this, and it works:
import math
def intervals(parts, duration):
part_duration = math.ceil(duration / parts)
return [(start, min(start + part_duration - 1, duration - 1))
for start in range(0, duration, part_duration)]
Inspecting the ranges
Inserting print statements:
print("Size = ", size)
print(section_intervals)
now gives:
Size = 9840497
[(0, 984049), (984050, 1968099), (1968100, 2952149), (2952150, 3936199), (3936200, 4920249), (4920250, 5904299), (5904300, 6888349), (6888350, 7872399), (7872400, 8856449), (8856450, 9840496)]
whereas using your original intervals
function, it gives:
Size = 9840497
[(0, 984049), (984049, 1968098), (1968098, 2952147), (2952147, 3936196), (3936196, 4920245), (4920245, 5904294), (5904294, 6888343), (6888343, 7872392), (7872392, 8856441), (8856441, 9840490)]
Note the overlapping ranges and the bytes missing from the end.
Verifying output using md5sum
We can verify the download at the end by calculating a checksum. In this example, I use md5sum
from the Linux command line (although cksum
would work also, as there is no need for cryptographic checksum for this purpose).
I called the output myoutput
.
$ md5sum myoutput
10c918b1d01aea85864ee65d9e0c2305 myoutput
Now I also download a copy directly with wget <url>
and see that it has the same checksum.
$ wget https://file-examples-com.github.io/uploads/2017/04/file_example_MP4_1280_10MG.mp4
--2020-07-21 08:26:52-- https://file-examples-com.github.io/uploads/2017/04/file_example_MP4_1280_10MG.mp4
$ md5sum file_example_MP4_1280_10MG.mp4
10c918b1d01aea85864ee65d9e0c2305 file_example_MP4_1280_10MG.mp4
来源:https://stackoverflow.com/questions/63008887/downloading-files-in-chunks-in-python