The protocol portion of the URL can consist of anything
that processing software can understand. Perhaps the most common URL
protocols (also called schemes) are HTTP, FTP, and FILE. HTTP is used to
connect to web servers, FTP is used to retrieve files, and FILE is used
to retrieve a local file. All are easily accomplished using Python’s
urllib
module.
The urllib.urlopen
function takes care of opening URLs of all kinds and can give you back a
file-like object to work with. To retrieve a local file, just use the
filename. For example, to open an XML document in the local directory,
you can use the following syntax:
>>> from urllib import urlopen >>> fd = urlopen("order.xml") >>> print fd.read( ) <?xml version="1.0"?> <!DOCTYPE order SYSTEM "order.dtd"> <order> <customer_name>eDonkey Enterprises</customer_name> <sku>343-3940938</sku> <unit_price>39.95</unit_price> </order> >>> fd.close( )
The urlopen
function returns a
file-like object. This object can then be treated as a file to retrieve
and display its contents. When the file object is closed, urlopen
cleans up its business as well,
terminating its connection to the remote server or local file.
The urlopen
function
works for remote files just as easily as it does for local files,
provided you’re connected to the Internet. For example, if you supply
a URL for an FTP server’s root directory, you may be able to pull back
its contents, as shown here:
>>> fd = urlopen("ftp://ftp.oreilly.com") >>> print fd.read( ) total 64 drwxr-xr-x 3 61 512 Aug 29 2000 bin drwxr-xr-x 2 3 512 Aug 30 2000 dev drwxr-xr-x 4 61 512 Oct 16 2000 etc lrwxrwxrwx 1 1 12 Aug 31 2000 examples -> pub/examples drwxrwx-wx 2 100 512 May 7 22:22 incoming drwxrws--x 48 61 17408 May 6 04:00 intl drwxr-xr-x 2 1 512 Sep 1 2000 lost+found drwxrws--x 55 61 4608 May 7 22:22 outgoing drwxrwsr-x 21 61 512 Mar 30 21:47 pub drwxr-xr-x 2 61 512 Aug 31 2000 published drwxr-sr-x 4 100 512 Apr 17 17:17 software dr-xr-xr-x 5 61 512 Aug 30 2000 usr >>> fd.close( )
urlretrieve
is similar to urlopen
. This function optionally accepts a
filename if you wish to store the remote file locally, and the
function returns a tuple of the filename and the actual data as a mime
message, as shown here:
>>> from urllib import urlretrieve >>> ob = urlretrieve("ftp://ftp.oreilly.com", "menu.txt") >>> ob ('menu.txt', <mimetools.Message instance at 007F382C>)
The first argument is the actual URL to connect to, while the second argument is the name of a local file to hold the data.
One of the most exciting features of urlretrieve
is its callback functionality.
When retrieving a document, you can supply a callback method as an
optional third parameter to receive progress reports as the resource
is downloaded.
If you supply a callback method, urlretrieve
expects your callback method to
take three arguments. The first argument is the current block number
on which the retrieval is operating. The second argument is the size
of the blocks being used, and the third is the total size of the file.
Example 8-1 shows a simple
routine that reports on its progress.
""" retrieve.py example """ from urllib import urlretrieve def callback(blocknum, blocksize, totalsize): print "Downloaded " + str((blocknum * blocksize)), print " of ", totalsize urlretrieve("http://www.example.com/pyxml.xml", "px.xml", callback) print "Download Complete"
The running example shows you:
C:WINDOWSDesktoporeillypythonxmlc8>python retrieve.py Downloaded 0 of 116063 Downloaded 8192 of 116063 Downloaded 16384 of 116063 Downloaded 24576 of 116063 Downloaded 32768 of 116063 Downloaded 40960 of 116063 Downloaded 49152 of 116063 Downloaded 57344 of 116063 Downloaded 65536 of 116063 Downloaded 73728 of 116063 Downloaded 81920 of 116063 Downloaded 90112 of 116063 Downloaded 98304 of 116063 Downloaded 106496 of 116063 Downloaded 114688 of 116063 Downloaded 122880 of 116063 Downloaded 131072 of 116063 Download Complete
The callback functionality is excellent for keeping track of FTP progress. The callback functionality is also great anytime you need to keep tabs on a long download, or communicate progress information to a frustrated, busy end-user.