While urllib
is
suitable for working with Internet files, you may still have the need to
perform more intricate communication with an HTTP server. For example,
if you are writing a Python program to communicate between two web
sites, you may need to adjust the headers to include any cookies the
site may require. You may need to emulate a certain browser type (by
placing its name in your User-Agent
header) if the site requires the latest version of Internet Explorer.
Working with httplib
as opposed to
urllib
in cases such as these allows
for finer control.
HTTP conversations between browsers and servers involve headers and data. The interaction between a web browser and a web server reveals a great deal of information about both parties. The HTTP headers that precede content from the server and precede requests from the browser contain a lot of metadata about both client and server. For example, when you type a URL into your browser and press return, a complete HTTP request is sent to the remote server that can look something like this:
GET /c7/favquote.cgi HTTP/1.1 Host: www.python.org Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg Accept-Language: en-us Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90) Connection: Keep-Alive
The headers tell the web server a great deal about the
capabilities of the client browser. From the first line of the headers
(GET /c7/favquote.cgi HTTP/1.1
), we
can tell that the request type is a GET
, the target file is /c7/favquote.cgi, and the HTTP version in
use is 1.1
. Beyond this essential
information is data telling the server what file types the browser can
accept, what the browser is, and what type of HTTP connection to use.
The Accept
lines tell the server
that your browser can handle .gif
and .jpeg files, as well as any
others. Notice there are three lines that start with Accept
in the HTTP headers. They show the
browser accepts en-us
as its
language, and both gzip
and
deflate
as encoding:.
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg Accept-Language: en-us Accept-Encoding: gzip, deflate
The User-Agent
informs the server which browser or HTTP client you’re using. Every
browser (and every HTTP library) populates this field with one thing
or another, letting web sites know how people are visiting. Some web
sites are designed to specifically utilize the features of either
Netscape Navigator or Internet Explorer and may redirect browsers to
one set of pages or another based on what it sees in the User-Agent
string:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)
This User-Agent is Internet Explorer 5.5 running on Microsoft’s
desktop platform. The last header in the previous example tells the
server what type of connection to use. In this case, it’s Keep-Alive
:
Connection: Keep-Alive
The Keep-Alive
connection
type tells the server to keep the socket between the client and the
server open for additional resources. Typically, when downloading a
web page, an initial request is made to retrieve the HTML page itself,
and then a series of subsequent requests are made to retrieve images
referenced with the <img>
tag, as well as linked stylesheets and framesets. Initiating a new
connection for each one of these resources would be very
time-consuming, especially considering the graphics-laden web pages in
use today. The Keep-Alive
option
lets the browser use the same channel it has already established to
bring down all of the additional resources.
In addition to the GET
request, three other request types
exist. Basically, a browser can do four things with a web server. It
can GET
a file. It can POST
data to a file as well, such as sending
a form to a CGI script. The two other lesser-known methods are
HEAD
and PUT
. HEAD
is used just to retrieve web server headers, and PUT
is used to actually send a file to the
server.
GET
Requests a file, and optionally contains a query string used by the file (if it’s a CGI or other executable) to generate dynamic data.
POST
Sends URL-encoded data to the server in a large
chunk. Frequently used to send form fields to the server.
Anything POST
ed can also be
sent via GET
, but the
difference is the query string becomes large and may look
unsightly or be unmanageable in the browser when doing a
GET
. A POST
is not carried on the query
string, and so is not visible to an end user. Some servers may
allow less data on a query string, and accept bigger chunks of
data in a POST
request.
HEAD
Similar to GET
, but
returns only the headers.
PUT
A seldom-used HTTP method to place files on the server.
When using this method, the filename that normally goes after a
GET
or a POST
request is used as the filename
for the content being delivered to the server. In other words,
instead of telling the server you want to GET
a page called /index.html, tell it you are going to
PUT
a file named /index.html on the server. During a
PUT
operation, the contents
of the file are sent after the headers, just like in a POST
operation.
To manually use HTTP from Python, use httplib
. The httplib
module is standard and ships with
Python 2.x. The HTTP
class within
httplib
features several import
methods for connecting to the server:
Req
=
HTTP(
address, [port]
)
Returns an instance of the HTTP class for use as a request object in this connection. The connection is also made to the given address and optional port. Most web servers run on port 80, and you would not need to supply this argument. However, some web sites are kept on different ports and this option then gives you the ability to select a specific port.
Req
.putrequest(
method,
file
)
Performs the initial HTTP request method with its accompanying filename and HTTP version indicator. This is the first line of the headers, as in:
GET /c7/favquote.cgi HTTP/1.1
Req
.putheader(
header-type,
value
)
Adds a new header to the request. This method would be
used to add your custom User-Agent
string, Accept-Types
, or cookies.
Req
.endheaders( )
Instructs the module to finish off the headers sent to the
server. In HTTP, the headers are separated from data with a
blank line. That is, when the server is sending back an HTML
page, it gives the browser its headers, followed by a blank
line, followed by the HTML document. Conversely, when the
browser is making a form POST
to a server, it does the same thing and separates its request
headers from its post data with a blank line.
Req
.send(
data
)
Sends data after your request. The send
method must be called after
endheaders
to speak proper
HTTP. You use this method when making a POST
or PUT
.
ErrorCode, ErrorMessage, Headers
=
Req
.getreply(
)
Gives you the server’s headers and response code in one
swoop. Both ErrorCode
and
ErrorMessage
should be 200
and OK respectively if everything is going well. The Headers
object is actually an instance
of mimetools.Message
.
fp.
=
Req
.getfile( )
Returns a file-like object that you can use to access the actual HTML (or other) document.
Using the HTTP class from httplib
is simple. Example 8-2 shows how to connect
to a server and retrieve the index.html page.
>>> from httplib import HTTP >>> req = HTTP("www.example.com") >>> req.putrequest("GET", "/index.html") >>> req.putheader("Accept", "text/html") >>> req.putheader("User-Agent", "MyPythonScript") >>> req.endheaders( ) >>> ec, em, h = req.getreply( ) >>> print ec, em 200 OK >>> fd = req.getfile( ) >>> textlines = fd.read( ) >>> fd.close( )
The steps taken in Example
8-2 are straightforward. The HTTP class is called with an
argument indicating the server, www.example.com. Next, headers are added
describing some minimal information for the server including the type
of data expected and the name of your user agent as MyPythonScript
. The error code and error
message are retrieved with a call to getreply
and the result is printed to the
console:
>>> print ec, em 200 OK
Next, getfile
is used to
retrieve the file-like object containing the document contents. The
getfile
method returns a file
descriptor that you can read with. After a call to read
, the return result is assigned to the
variable textlines
that now
contains the actual document. Calling close
finishes off the request. You can
print textlines
to see what you
have retrieved:
>>> print textlines <html> <head> <link rel="stylesheet" type="text/css" href="/zpath.css"> </head> <body BGCOLOR="#CDFF00"> <p> <table WIDTH=100% height=100%> <tr> <td VALIGN="top" ALIGN="left"> <a HREF="/cgi-bin/start.cgi?page=top"> <img SRC="images/zplogo.gif" WIDTH=457 HEIGHT=144 BORDER=0> </a> </td> </tr> </table> </p> </body> </html>
Of special note in this request is the User-Agent
string. Most web site
administrators run access reports and generate neat sets of statistics
detailing the browser types in use. By writing your own Python
Internet programs, you can add to the statistics. In Example 8-2, we set the User-Agent
string to MyPythonScript
by calling:
req.putheader("User-Agent", "MyPythonScript")
This is captured in the server logs, and most likely show up in the less-than-one-percent category of the site administrator’s browser statistics.
Example 8-2 shows
how to request a specific file. Say you’d like to also add a query
string to your GET
request. The
second argument to HTTP.putrequest
is the filename you’re after. To add a query string to the HTTP
request, you could couple the filename with your data, as shown here,
properly URL-encoded:
req.putrequest("GET", "/handler.cgi?id=12345")
If you need to encode your data because it contains special
characters, you could use the urllib
’s quote
function described earlier in this
chapter, in Section
8.2.2.
req.putrequest("GET", "/handler.cgi?" + quote("numbers=1/2/3/4/5"))
Any hungry server administrator may be disappointed to
learn that the cookies your browser sends to his web site are
electronic. Cookies are frequently delivered to web servers by
browsers to indicate a special identification for your browser. Your
browser keeps the cookie and returns it whenever the same web site or
document is requested. This lets the web server personalize site
content for you, or connect you with some specific data that may be
held in a database, such as your profile information or virtual
shopping cart. If you are writing Python scripts to go between web
sites, you may need to send cookies in your headers. You use the
putheader
method of the HTTP class
to do so, as shown here:
req.putheader("Cookie", "key=value")
Conversely, when the server is sending cookies to your browser a
set-cookie
header is thrown in the
mix with the other headers and digested by your browser.
If you are manually using HTTP from Python, odds are
you’re moving documents around. You may be hitting one URL to get
information from a database, constructing a form, and submitting that
data to another web site via the POST
operation. Creating a POST
with httplib
is straightforward, but more
intricate than the examples shown thus far in this chapter.This method
is detailed in the following sections.
Any example illustrating a POST
is of no value without something to
post to. So, for the purpose of this example, you can create a
simple CGI script that echoes back your posted data. To use, place
this ten-line file in a CGI-capable directory of your web server as
favquote.cgi, shown in Example 8-3.
#!/usr/bin/python import cgi form = cgi.FieldStorage( ) favquote = form["favquote"].value print "Content-type: text/html" print "" print "<html><body>" print "Your quote is: " print favquote print "</body></html>"
This simple CGI uses the cgi
module to retrieve the data sent in
the post. We make a post to this CGI in the next section.
One of the more interesting points of making a
POST
is ensuring that your data
is properly URL-encoded. This means ensuring the favquote
key is not encoded in the data if
your CGI script is looking for a variable named favquote
. For example, a proper key=value
pair should be:
favquote=This%20is%20my%20quote%3a%20%22I%20think%20therefore%20I%20am.%22
However, if in your enthusiasm you quote the entire string and not just the value portion, you wind up with:
favquote%3dThis%20is%20my%20favorite%20quote%3a%20...
Unfortunately, the server will not know what to do with the
second flawed scenario, as there is no key to associate the value
with, as favquote=
has been
transformed into favquote%3d
.
In Example
8-3, we created a GET
request using the methods of httplib
. Performing a POST
requires a couple of extra method
calls, and very precise order of events. The sequence of the HTTP
calls is important, and making a post
requires extra headers. The start of
a POST
request is similar to a
GET
, as shown here:
req = HTTP("192.168.1.23")
req.putrequest("POST", "/c7/favquote.cgi")
req.putheader("Accept", "text/html")
req.putheader("User-Agent", "MyPythonScript")
Note that this time, the first argument to putrequest
is POST
. Beyond the change from GET
to POST
, the call to putrequest
looks the same. When posting
data to the server, it’s important that the server know exactly how
many bytes of data the HTTP client is sending. While the HTTP
headers rely on line breaks and a blank line as field delimiters,
posted data may contain all sorts of special characters, binary
data, or other nonprintable characters. Therefore, instead of
relying on line breaks, the server requires that you specify how
many bytes you’re sending, and then reads that number of bytes from
your request. Specify the content length using putheader
(note that you must know the
number of bytes):
myquote = 'This is my quote: "I think therefore I am."'
postdata = "favquote=" + quote(myquote)
req.putheader("Content-Length", str(len(postdata)))
In these calls, you assemble the post
data by concatenating the key portion
(favquote=
) with the quoted
value. Use the len
function to
size up your URL-encoded string named postdata
. Finally, since putheader
expects a string as a second
argument and not a number, convert the length with the str
function.
The HTTP.send
method is
used to submit the data after ending the headers:
req.endheaders( ) req.send(postdata)
Now, you can get the reply and read the results as you did
with your GET
request earlier in
Example 8-3. The result
of the POST
may be dynamically
generated data such as search results, or it could be an HTML page
detailing a problem associated with the POST
.
As you can see, performing a POST
(rather than a GET
) requires learning a few more steps.
The file post.py, shown in
Example 8-4, pulls these
ideas together and illustrates a complete POST
operation. If you copied Example 8-3, favquote.cgi, to a CGI-capable directory
on your web server, you should be able to run post.py from the command line. Be sure
and put the appropriate IP address or localhost
in the call to the HTTP
constructor!
""" post.py """ from httplib import HTTP from urllib import quote # establish POST data myquote = 'This is my quote: "I think therefore I am."' # be sure not to quote the key= sequence... postdata = "favquote=" + quote(myquote) print "Will POST ", len(postdata), "bytes:" print postdata # begin HTTP request req = HTTP("127.0.0.1") # change to a different IP if needed req.putrequest("POST", "/c7/favquote.cgi") req.putheader("Accept", "text/html") req.putheader("User-Agent", "MyPythonScript") # Set Content-length to length of postdata req.putheader("Content-Length", str(len(postdata))) req.endheaders( ) # send post data after ending headers, # CGI script will receive it on STDIN req.send(postdata) ec, em, h = req.getreply( ) print "HTTP RESPONSE: ", ec, em # get file-like object from HTTP response # and print received HTML to screen fd = req.getfile( ) textlines = fd.read( ) fd.close( ) print " Received following HTML: " print textlines
The raw HTTP headers and post
data that post.py produces are shown below:
POST /c7/favquote.cgi HTTP/1.1 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg Accept-Language: en-us Content-Type: application/x-www-form-urlencoded Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90) Host: 192.168.1.45 Content-Length: 70 Connection: Keep-Alive favquote=This+is+my+favorite+quote%3A+%22I+think%2C+therefore+I+am.%22
As you can see, the content length is specified and the data follows exactly one blank line after the headers.
Thus far in this chapter, we’ve encountered urllib
and httplib
, and have retrieved generic URLs
and created custom HTTP requests. Now we are going to take a look at
Python’s support for implementing the server side of the
connection.