The URL contains a great deal of Internet information in a single string. It tells you the name of the server, the name of the file on the server, any data that you are supplying to generate a dynamic response, and even the protocol to use to retrieve the information. In basic form, URLs look like this:
http://www.oreilly.com/oreilly/about.html |
This URL has three elements. The first section tells you (or your
software) the protocol in use for this resource. In this case, it is
HTTP, shown by http:
. The next
section indicates the server name and its corresponding domain. In this
case the server is named www
, and the
domain is oreilly.com
, coming
together as //www.oreilly.com.
What
follow are a pathname (/oreilly/
) and
a filename (about.html
). Your browser
uses this information as it comes to the brilliant conclusion to use
HTTP in connecting with www
in
oreilly.com
, and retrieves the
/oreilly/about.html
file.
Of course, URLs can become more complicated. If you type “Python” into a search box and click Submit, your browser may go after a URL similar to the following:
http://search.oreilly.com/cgi-bin/search?term=Python&category=All&pref=all |
Now there are several more items to examine. First, the server has
changed from www
to search
. Second, the path has changed from
/oreilly/
to /cgi-bin/
. The filename about.html has been replaced with a target
named search
. But most interesting is
the question mark and the data that follows:
?term=Python&category=All&pref=all
This portion of the URL is known as the query string. If
search
is a CGI program (or something
similar inside an application server) the query string is supplied to it
in the form of an environment variable. The CGI program can pick the
string apart to realize that a variable named term
is set to Python
, and that category
and pref
are equal to All
and all
respectively. As you can imagine, this information is relevant to the
O’Reilly database and appropriate product information is returned to
your browser.
However, suppose that instead of searching the O’Reilly site for “Python”, you searched it for “Python!”. What does the URL look like now? Well, the only difference is that the exclamation point is URL-encoded. That is, only a few special characters are allowed within a URL, all others are escaped to their respective hexadecimal code and delimited with a percent (%) sign. This time, the query string looks slightly different:
?term=Python%21&category=All&pref=all
The exclamation point is now replaced with %21
, which is its URL-encoded cousin.
If you are constructing a URL programmatically for submission to a web site, you find yourself needing to supply parameters in the query string, as shown in the previous section.
Programmatic construction of URLs may be necessary when integrating your Python program with a dynamic web site expecting query parameters in the query string.
The Python urllib
module features the method urlencode
. This method accepts a dictionary
of key/value pairs and returns a properly formatted query string that
you could tag onto a URL. For example, if you have an arbitrarily
sized dictionary, you could call urlencode
with the dictionary as a
parameter, as shown here:
>>> from urllib import urlencode >>> myDict = {... "Name" : "Chris Jones",
... "Address" : "Woodinville, WA",
... "Favorite Characters" : "#, @, $, and %"
... } >>> strUrl = urlencode(myDict) >>> print strUrlAddress=Woodinville%2c+WA&Name=Chris+Jones&Favorite+Characters=
%23%2c+%40%2c+%24%2c+and+%25
What constitutes strURL
here
is not a complete URL. It’s just the query data that comes at the end
of the URL. The first half of the URL needs to include the protocol,
as well as the server and domain pairing:
http://www.example.com/search.cgi?Address=Woodinville%2c+WA&Name=Chris+Jones&Favorite+Characters= %23%2c+%40%2c+%24%2c+and+%25
The urlencode
method takes
care of escaping special characters as it translates #, @, $, and %
into %23%2c+%40%2c+%24%2c+and+%25
. Not only are
the special characters translated, but the commas and spaces have also
been converted to their hexadecimal values.
The quote
method of
urllib
that takes a single string
of data and performs the necessary encoding related to urlencode
that takes a dictionary as a
parameter. The primary difference is quote
does not automatically generate
key/value pairings based on a dictionary. The quote
method exists to convert a single
string into a URL-compliant syntax. For example, if a URL you are
constructing consists of
http://www.example.com/addQuotation.cgi?myquote=,
but you need to add a URL-compliant value to it, you could use the
quote
method to encode it:
>>> from urllib import quote >>> quote('Famous Quote: "I think, therefore I am."') 'Famous%20Quote%3a%20%22I%20think%2c%20therefore%20I%20am.%22'
Perhaps the most important thing to remember is that quote
should be used to encode a
single string, not a key/value pair. In any key=value
combination, only the
value should be encoded with the quote command. If you were
to include the myquote=
(or the
key) portion of the query string when calling the quote
method, the equal sign would also be
encoded rendering the URL worthless.
What goes up must come down. If you are encoding URLs
programmatically, the odds are that you are going to need to decode
one at some point or another. The unquote
method of urllib
takes an encoded string (such as that
generated by quote
) and returns the
decoded version of it:
>>> from urllib import unquote >>> unquote("Famous%20Quote%3a%20%22I%20think%2c%20therefore%20I%20am.%22") 'Famous Quote: "I think, therefore I am."'
If you are constructing and deconstructing URLs programmatically, it follows that actually connecting to these and retrieving their content is of value.