Credit: Chris Moffitt
Large downloads are sometimes interrupted. However, a good HTTP
server that supports the Range header lets you resume the download
from where it was interrupted. The standard Python module
urllib
lets you access this functionality almost seamlessly. You need to add
only the needed header and intercept the error code the server sends
to confirm that it will respond with a partial file:
import urllib, os
class myURLOpener(urllib.FancyURLopener):
""" Subclass to override error 206 (partial file being sent); okay for us """
def http_error_206(self, url, fp, errcode, errmsg, headers, data=None):
pass # Ignore the expected "non-error" code
def getrest(dlFile, fromUrl, verbose=0):
loop = 1
existSize = 0
myUrlclass = myURLOpener( )
if os.path.exists(dlFile):
outputFile = open(dlFile,"ab")
existSize = os.path.getsize(dlFile)
# If the file exists, then download only the remainder
myUrlclass.addheader("Range","bytes=%s-" % (existSize))
else:
outputFile = open(dlFile,"wb")
webPage = myUrlclass.open(fromUrl)
if verbose:
for k, v in webPage.headers.items( ):
print k, "=", v
# If we already have the whole file, there is no need to download it again
numBytes = 0
webSize = int(webPage.headers['Content-Length'])
if webSize == existSize:
if verbose: print "File (%s) was already downloaded from URL (%s)"%(
dlFile, fromUrl)
else:
if verbose: print "Downloading %d more bytes" % (webSize-existSize)
while 1:
data = webPage.read(8192)
if not data:
break
outputFile.write(data)
numBytes = numBytes + len(data)
webPage.close( )
outputFile.close( )
if verbose:
print "downloaded", numBytes, "bytes from", webPage.url
return numbytes
The HTTP Range header lets the web server know that you want only a certain range of data to be downloaded, and this recipe takes advantage of this header. Of course, the server needs to support the Range header, but since the header is part of the HTTP 1.1 specification, it’s widely supported. This recipe has been tested with Apache 1.3 as the server, but I expect no problems with other reasonably modern servers.
The recipe lets urllib.FancyURLopener
to do all
the hard work of adding a new header, as well as the normal
handshaking. I had to subclass it to make it known that the error 206
is not really an error in this case—so you can proceed
normally. I also do some extra checks to quit the download if
I’ve already downloaded the whole file.
Check out the HTTP 1.1 RFC (2616) to learn more about what all of the
headers mean. You may find a header that is very useful, and
Python’s urllib
lets you send any
header you want. This recipe should probably do a check to make sure
that the web server accepts Range, but this is pretty simple to do.
Documentation of the standard library module
urllib
in the Library Reference; the HTTP 1.1 RFC (http://www.ietf.org/rfc/rfc2616.txt).