Download file using partial download (HTTP)

Is there a way to download huge and still growing file over HTTP using the partial-download feature?

It seems that this code downloads file from scratch every time it executed:

import urllib
urllib.urlretrieve ("http://www.example.com/huge-growing-file", "huge-growing-file")

I'd like:

To fetch just the newly-written data
Download from scratch only if the source file becomes smaller (for example it has been rotated).

It is possible to do partial download using the range header, the following will request a selected range of bytes:

req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, end)
f = urllib2.urlopen(req)

For example:

>>> req = urllib2.Request('http://www.python.org/')
>>> req.headers['Range'] = 'bytes=%s-%s' % (100, 150)
>>> f = urllib2.urlopen(req)
>>> f.read()
'l1-transitional.dtd">\n\n\n<html xmlns="http://www.w3.'

Using this header you can resume partial downloads. In your case all you have to do is to keep track of already downloaded size and request a new range.

Keep in mind that the server need to accept this header for this to work.

This is quite easy to do using TCP sockets and raw HTTP. The relevant request header is "Range".

An example request might look like:

mysock = connect(("www.example.com", 80))
mysock.write(
  "GET /huge-growing-file HTTP/1.1\r\n"+\
  "Host: www.example.com\r\n"+\
  "Range: bytes=XXXX-\r\n"+\
  "Connection: close\r\n\r\n")

Where XXXX represents the number of bytes you've already retrieved. Then you can read the response headers and any content from the server. If the server returns a header like:

Content-Length: 0

You know you've got the entire file.

If you want to be particularly nice as an HTTP client you can look into "Connection: keep-alive". Perhaps there is a python library that does everything I have described (perhaps even urllib2 does it!) but I'm not familiar with one.

If I understand your question correctly, the file is not changing during download, but is updated regularly. If that is the question, rsync is the answer.

If the file is being updated continually including during download, you'll need to modify rsync or a bittorrent program. They split files into separate chunks and download or update the chunks independently. When you get to the end of the file from the first iteration, repeat to get the appended chunk; continue as necessary. With less efficiency, one could just repeatedly rsync.

来源：https://stackoverflow.com/questions/1798879/download-file-using-partial-download-http

标签

python

http

partial