14. Web Programming

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14. Web Programming

Introduction

Credit: Andy McKay

The Web has been a key technology for many years now, and it has become unusual to develop an application that doesn’t involve some aspects of the Web. From showing a help file in a browser to using web services, the Web has become an integral part of most applications.

I came to Python through a rather tortuous path of ASP (Active Server Pages), then Perl, some Zope, and then Python. Looking back, it seems strange that I didn’t find Python earlier, but the dominance of Perl and ASP (and later PHP) in this area makes it difficult for new developers to see the advantages of Python shining through all the other languages.

Unsurprisingly, Python is an excellent language for web development, and, as a batteries included language, Python comes with most of the modules you need. The relatively recent inclusion of xmlrpclib in the Python Standard Library is a reassuring indication that batteries continue to be added as the march of technology requires, making the standard libraries even more useful. One of the modules I often use is urllib, which demonstrates the power of a simple, well-designed module—saving a file from the Web in two lines (using urlretrieve) is easy. The cgi module is another example of a module that has enough functionality to work with, but not too much to make your scripts slow and bloated.

Compared to other languages, Python seems to have an unusually large number of application servers and templating languages. While it’s easy to develop anything for the Web in Python “from scratch”, it would be peculiar and unwise to do so without first looking at the application servers available. Rather than continually recreating dynamic pages and scripts, the community has taken on the task of building these application servers to allow other users to create the content in easy-to-use templating systems.

Zope is the most well-known product in this space and provides an object-oriented interface to web publishing. With features too numerous to mention, Zope allows a robust and powerful object-publishing environment. The new, revolutionary major release, Zope 3, makes Zope more Pythonic and powerful than ever. Quixote and WebWare are two other application servers with similar, highly modular designs. Any of these can be a real help to the overworked web developer who needs to reuse components and to give other users the ability to create web sites. The Twisted network-programming framework, increasingly acknowledged as the best-of-breed Python framework for asynchronous network programming, is also starting to expand into the web application server field, with its newer “Nevow” offshoot, which you’ll also find used in some of the recipes in this chapter.

For all that, an application server is just too much at times, and a simple CGI script is really all you need. Indeed, the very first recipe, Recipe 14.1, demonstrates all the ingredients you need to make sure that your web server and Python CGI scripting setup are working correctly. Writing a CGI script doesn’t get much simpler than this, although, as the recipe’s discussion points out, you could use the cgi.test function to make it even shorter.

Another common web-related task is the parsing of HTML, either on your own site or on other web sites. Parsing HTML tags correctly is not as simple as many developers first think, as they optimistically assume a few regular expressions or string searches will see them through. However, we have decided to deal with such issues in other chapters, such as Chapter 1, rather than in this one. After all, while HTML was born with and for the Web, these days HTML is also often used in other contexts, such as for distributing documentation. In any case, most web developers create more than just web pages, so, even if you, the reader, primarily identify as a web developer, and maybe turned to this chapter as your first one in the book, you definitely should peruse the rest of the book, too: many relevant, useful recipes in other chapters describe parsing XML, reading network resources, performing systems administration, dealing with images, and many great ideas about developing in Python, testing your programs, and debugging them!

14.1. Testing Whether CGI Is Working

Credit: Jeff Bauer, Carey Evans

Problem

You want a simple CGI (Common Gateway Interface) program to use as a starting point for your own CGI programming or to determine whether your setup is functioning properly.

Solution

The cgi module is normally used in Python CGI programming, but here we use only its escape function to ensure that the value of an environment variable doesn’t accidentally look to the browser as HTML markup. We do all of the real work ourselves in the following script:

#!/usr/local/bin/python
print "Content-type: text/html"
print
print "<html><head><title>Situation snapshot</title></head><body><pre>"
import sys
sys.stderr = sys.stdout
import os
from cgi import escape
print "<strong>Python %s</strong>" % sys.version
keys = os.environ.keys( )
keys.sort( )
for k in keys:
    print "%s	%s" % (escape(k), escape(os.environ[k]))
print "</pre></body></html>"

Discussion

CGI is a standard that specifies how a web server runs a separate program (often known as a CGI script) that generates a web page dynamically. The protocol specifies how the server provides input and environment data to the script and how the script generates output in return. You can use any language to write your CGI scripts, and Python is well suited for the task.

This recipe is a simple CGI program that takes no input and just displays the current version of Python and the environment values. CGI programmers should always have some simple code handy to drop into their cgi-bin directories. You should run this script before wasting time slogging through your Apache configuration files (or whatever other web server you want to use for CGI work). Of course, cgi.test does all this and more, but it may, in fact, do too much. It does so much, and so much is hidden inside cgi’s innards, that it’s hard to tweak it to reproduce any specific problems you may be encountering in true scripts. Tweaking the program in this recipe, on the other hand, is very easy, since it’s such a simple program, and all the parts are exposed.

Besides, this little script is already quite instructive in its own way. The starting line, #!/usr/local/bin/python, must give the absolute path to the Python interpreter with which you want to run your CGI scripts, so you may need to edit it accordingly. A popular solution for non-CGI scripts is to have a first line (the so-called shebang line) that looks something like this:

#!/usr/bin/env python

However, this approach puts you at the mercy of the PATH environment setting, since it runs the first program named python that it finds on the PATH, and that may well not be what you want under CGI, where you don’t fully control the environment. Incidentally, many web servers implement the shebang line even when running under non-Unix systems, so that, for CGI use specifically, it’s not unusual to see Python scripts on Windows start with a first line such as:

#!c:/python23/python.exe

Another issue you may be contemplating is why the import statements are not right at the start of the script, as is the usual Python style, but are preceded by a few print statements. The reason is that import could fail if the Python installation is terribly misconfigured. In case of failure, Python emits diagnostics to standard error (which is typically directed to your web server logs, depending on how you set up and configured your web server), and nothing will go to standard output. The CGI standard demands that all output be on standard output, so we first ensure that a minimal quantity of output will display a result to a visiting browser. Then, assuming that import sys succeeds (if it fails, the whole Python installation is so badly broken that you can do very little about it!), we immediately perform the following assignment:

sys.stderr = sys.stdout

This assignment statement ensures that error output will go to standard output, so that you’ll have a chance to see it in the visiting browser. You can perform other import operations or do further work in the script only when this is done. Another option makes getting tracebacks for errors in CGI scripts much simpler. Simply add the following at the start of your script:

import cgitb; cgitb.enable( )

and the standard Python library module cgitb takes care of whatever else is needed to get error tracebacks on the browser. However, as already stated, the point of this recipe is to show how everything is done, rather than just reusing prepackaged funcitonality.

One last reflection is that, in Python 2.4, instead of the three lines:

keys = os.environ.keys( )
keys.sort( )
for k in keys:

used in the recipe, you could use the single line:

for k in sorted(os.environ):

Unfortunately, since CGI scripts must often run in environments you do not control, I cannot suggest you code to a specific, recent version of Python in this particular case—particularly not a script such as this one, which is meant to let you examine and check out the exact circumstances under which your CGI runs.

Yet another consideration, not strictly related to Python, is that this script is coded to emit correct HTML. Just about all known browsers let you get away with skipping most of the HTML tags that this script outputs, but why skimp on correctness, relying on the browser to patch your holes? It costs little to emit correct HMTL, so you should get into the habit of doing things right, when the cost is so modest. (I wish more authors of web pages, and of programs producing web pages, shared this sentiment. If they did, there would be a lot less broken HTML out on the Web!)

14.2. Handling URLs Within a CGI Script

Credit: Jürgen Hermann

Problem

You need to build URLs within a CGI script—for example, to send an HTTP redirection header.

Solution

To build a URL within a script, you need information such as the hostname and script name. According to the CGI standard, the web server sets up a lot of useful information in the process environment of a script before it runs the script itself. In a Python script, we can access the process environment as the dictionary os.environ, an attribute of the standard Python library os module, and through accesses to the process environment build our own module of useful helper functions:

import os, string
def isSSL( ):
    """ Return true if we are on an SSL (https) connection. """
    return os.environ.get('SSL_PROTOCOL', '') != ''
def getScriptname( ):
    """ Return the scriptname part of the URL ("/path/to/my.cgi"). """
    return os.environ.get('SCRIPT_NAME', '')
def getPathinfo( ):
    """ Return the remaining part of the URL. """
    pathinfo = os.environ.get('PATH_INFO', '')
    # Fix for a well-known bug in IIS/4.0
    if os.name == 'nt':
        scriptname = getScriptname( )
        if pathinfo.startswith(scriptname):
            pathinfo = pathinfo[len(scriptname):]
    return pathinfo
def getQualifiedURL(uri=None):
    """ Return a full URL starting with schema, servername, and port.
        Specifying uri causes it to be appended to the server root URL
        (uri must then start with a slash).
    """
    schema, stdport = (('http', '80'), ('https', '443'))[isSSL( )]
    host = os.environ.get('HTTP_HOST', '')
    if not host:
        host = os.environ.get('SERVER_NAME', 'localhost')
        port = os.environ.get('SERVER_PORT', '80')
        if port != stdport: host = host + ":" + port
    result = "%s://%s" % (schema, host)
    if uri: result = result + uri
    return result
def getBaseURL( ):
    """ Return a fully qualified URL to this script. """
    return getQualifiedURL(getScriptname( ))

Discussion

URLs can be manipulated in numerous ways, but many CGI scripts have common needs. This recipe collects a few typical high-level functional needs for URL synthesis from within CGI scripts. You should never hard-code hostnames or absolute paths in your scripts. Doing so makes it difficult to port the scripts elsewhere or rename a virtual host. The CGI environment has sufficient information available to avoid such hard-coding. By importing this recipe’s code as a module, you can avoid duplicating code in your scripts to collect and use that information in typical ways.

The recipe works by accessing information in os.environ, the attribute of Python’s standard os module that collects the process environment of the current process and lets your script access it as if it were a normal Python dictionary. In particular, os.environ has a get method, just like a normal dictionary does, that returns either the mapping for a given key or, if that key is missing, a default value that you supply in the call to get. This recipe performs all accesses through os.environ.get, thus ensuring sensible behavior even if the relevant environment variables have been left undefined by your web server (which should never happen—but not all web servers are free of bugs).

Among the functions presented in this recipe, getQualifiedURL is the one you’ll use most often. It transforms a URI (Universal Resource Identifier) into a URL on the same host (and with the same schema) used by the CGI script that calls it. It gets the information from the environment variables HTTP_HOST, SERVER_NAME, and SERVER_PORT. Furthermore, it can handle secure (https) as well as normal (http) connections, and selects between the two by using the isSSL function, which is also part of this recipe.

Suppose you need to redirect a visiting browser to another location on this same host. Here’s how you can use a function from this recipe, hard-coding only the redirect location on the host itself, but not the hostname, port, and normal or secure schema:

# example redirect header:
print "Location:", getQualifiedURL("/go/here")

14.3. Uploading Files with CGI

Credit: Noah Spurrier, Georgy Pruss

Problem

You need to enable the visitors to your web site to upload files by means of a CGI script.

Solution

Net of any security checks, safeguards against denial of service (DOS) attacks, and the like, the task boils down to what’s exemplified in the following CGI script:

#!/usr/local/bin/python
import cgi
import cgitb; cgitb.enable( )
import os, sys
try: import msvcrt             # are we on Windows?
except ImportError: pass       # nope, no problem
else:                          # yep, need to set I/O to binary mode
    for fd in (0, 1): msvcrt.setmode(fd, os.O_BINARY)
UPLOAD_DIR = "/tmp"
HTML_TEMPLATE = 
"""<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>Upload Files</title>
</head><body><h1>Upload Files</h1>
<form action="%(SCRIPT_NAME)s" method="POST" enctype="multipart/form-data">
File name: <input name="file_1" type="file"><br>
File name: <input name="file_2" type="file"><br>
File name: <input name="file_3" type="file"><br>
<input name="submit" type="submit">
</form> </body> </html>"""
def print_html_form( ):
    """ print the form to stdout, with action set to this very script (a
       'self-posting form': script both displays AND processes the form)."""
    print "content-type: text/html; charset=iso-8859-1
"
    print HTML_TEMPLATE % {'SCRIPT_NAME': os.environ['SCRIPT_NAME']}
def save_uploaded_file(form_field, upload_dir):
    """ Save to disk a file just uploaded, form_field being the name of the
        file input field on the form.  No-op if field or file is missing."""
    form = cgi.FieldStorage( )
    if not form.has_key(form_field): return
    fileitem = form[form_field]
    if not fileitem.file: return
    fout = open(os.path.join(upload_dir, fileitem.filename), 'wb')
    while True:
        chunk = fileitem.file.read(100000)
        if not chunk: break
        fout.write(chunk)
    fout.close( )
save_uploaded_file("file_1", UPLOAD_DIR)
save_uploaded_file("file_2", UPLOAD_DIR)
save_uploaded_file("file_3", UPLOAD_DIR)
print_html_form( )

Discussion

The CGI script shown in this recipe is very bare-bones, but it does get the job done. It’s a self-posting script: it displays the upload form, and it processes the form when the user submits it, complete with any uploaded files. The script just saves files to an upload directory, which in the recipe is simply set to /tmp.

The script as presented takes no precaution against DOS attacks, so a user could try to fill up your disk with endless uploads. If you deploy this script on a system that is accessible to the public, do add checks to limit the number and size of files written to disk, perhaps depending, also, on how much disk space is still available. A version that might perhaps be more to your liking can be found at http://zxw.nm.ru/test_w_upload.py.htm.

14.4. Checking for a Web Page’s Existence

Credit: James Thiele, Rogier Steehouder

Problem

You want to check whether an HTTP URL corresponds to an existing web page.

Solution

Using httplib allows you to easily check for a page’s existence without actually downloading the page itself, just its headers. Here’s a module implementing a function to perform this task:

"""
httpExists.py
A quick and dirty way to check whether a web file is there.
Usage:
>>> import httpExists
>>> httpExists.httpExists('http://www.python.org/')
True
>>> httpExists.httpExists('http://www.python.org/PenguinOnTheTelly')
Status 404 Not Found : http://www.python.org/PenguinOnTheTelly
False
"""
import httplib, urlparse
def httpExists(url):
    host, path = urlparse.urlsplit(url)[1:3]
    if ':' in host:
        # port specified, try to use it
        host, port = host.split(':', 1)
        try:
            port = int(port)
        except ValueError:
            print 'invalid port number %r' % (port,)
            return False
    else:
        # no port specified, use default port
        port = None
    try:
        connection = httplib.HTTPConnection(host, port=port)
        connection.request("HEAD", path)
        resp = connection.getresponse( )
        if resp.status == 200:       # normal 'found' status
            found = True
        elif resp.status == 302:     # recurse on temporary redirect
            found = httpExists(urlparse.urljoin(url,
                               resp.getheader('location', '')))
        else:                        # everything else -> not found
            print "Status %d %s : %s" % (resp.status, resp.reason, url)
            found = False
    except Exception, e:
        print e._ _class_ _, e, url
        found = False
    return found
def _test( ):
    import doctest, httpExists
    return doctest.testmod(httpExists)
if _ _name_ _ == "_ _main_ _":
    _test( )

Discussion

While this recipe is very simple and runs quite fast (thanks to the ability to use the HTTP command HEAD to get just the headers, not the body, of the page), it may be too simplistic for your specific needs: the HTTP result codes you might need to deal with may go beyond the simple 200 success code, and 302 temporary redirect, to include permanent redirects, temporary inaccessibility, permission problems, and so on.

In my case, I needed to check the correctness of a huge number of mutual links among pages of a site generated by a complex web application on an intranet, so I knew I had the privilege of relying on a simple check for “200 or bust.” At any rate, you can use this simple recipe as a starting point to which to add any refinements you determine you actually need.

14.5. Checking Content Type via HTTP

Credit: Bob Stockwell

Problem

You need to determine whether a URL, or an open file, obtained from urllib.open on a URL, is of a particular content type (such as 'text' for HTML or 'image' for GIF).

Solution

The content type of any resource can easily be checked through the pseudo-file that urllib.urlopen returns for the resource. Here is a function to show how to perform such checks:

import urllib
def isContentType(URLorFile, contentType='text'):
    """ Tells whether the URL (or pseudofile from urllib.urlopen) is of
        the required content type (default 'text').
    """
    try:
        if isinstance(URLorFile, str):
            thefile = urllib.urlopen(URLorFile)
        else:
            thefile = URLorFile
        result = thefile.info( ).getmaintype( ) == contentType.lower( )
        if thefile is not URLorFile:
            thefile.close( )
    except IOError:
        result = False    # if we couldn't open it, it's of _no_ type!
    return result

Discussion

For greater flexibility, this recipe accepts either the result of a previous call to urllib.urlopen, or a URL in string form. In the latter case, the Solution opens the URL with urllib and, at the end, closes the resulting pseudo-file again. If the attempt to open the URL fails, the recipe catches the IOError and returns a result of False, considering that a URL that cannot be opened is of no type at all, and therefore in particular is not of the type the caller was checking for. (Alternatively, you might prefer to propagate the exception; if that’s what you want, remove the try and except clause headers and the result = False assignment that is the body of the except clause.)

Whether the pseudo-file was passed in or opened locally from a URL string, the info method of the pseudo-file gives as its result an instance of mimetools.Message (which doesn’t mean you need to import mimetools yourself—urllib does all that’s needed). On that object, we can call any of several methods to get the content type, depending on what exactly we want—gettype to get both main and subtype with a slash in between (as in 'text/plain'), getmaintype to get the main type (as in 'text'), or getsubtype to get the subtype (as in 'plain'). In this recipe, we want the main content type.

The string result from all of the type interrogation methods is always lowercase, so we take the precaution of calling the lower method on parameter contentType as well, before comparing for equality.

14.6. Resuming the HTTP Download of a File

Credit: Chris Moffitt

Problem

You need to resume an HTTP download of a file that has been partially transferred.

Solution

Downloads of large files are sometimes interrupted. However, a good HTTP server that supports the Range header lets you resume the download from where it was interrupted. The standard Python module urllib lets you access this functionality almost seamlessly: you just have to add the required header and intercept the error code that the server sends to confirm that it will respond with a partial file. Here is a function, with a little helper class, to perform this task:

import urllib, os
class myURLOpener(urllib.FancyURLopener):
    """ Subclass to override err 206 (partial file being sent); okay for us """
    def http_error_206(self, url, fp, errcode, errmsg, headers, data=None):
        pass    # Ignore the expected "non-error" code
def getrest(dlFile, fromUrl, verbose=0):
    myUrlclass = myURLOpener( )
    if os.path.exists(dlFile):
        outputFile = open(dlFile, "ab")
        existSize = os.path.getsize(dlFile)
        # If the file exists, then download only the remaindermyUrlclass.addheader("Range","bytes=%s-" % (existSize))
    else:
        outputFile = open(dlFile, "wb")
        existSize = 0
    webPage = myUrlclass.open(fromUrl)
    if verbose:
        for k, v in webPage.headers.items( ):
            print k, "=", v
    # If we already have the whole file, there is no need to download it again
    numBytes = 0
    webSize = int(webPage.headers['Content-Length'])
    if webSize == existSize:
        if verbose:
            print "File (%s) was already downloaded from URL (%s)" % (
                   dlFile, fromUrl)
    else:
        if verbose:
            print "Downloading %d more bytes" % (webSize-existSize)
        while True:
            data = webPage.read(8192)
            if not data:
                break
            outputFile.write(data)
            numBytes = numBytes + len(data)
    webPage.close( )
    outputFile.close( )
    if verbose:
        print "downloaded", numBytes, "bytes from", webPage.url
    return numbytes

Discussion

The HTTP Range header lets the web server know that you want only a certain range of data to be downloaded, and this recipe takes advantage of this header. Of course, the server needs to support the Range header, but since the header is part of the HTTP 1.1 specification, it’s widely supported. This recipe has been tested with Apache 1.3 as the server, but I expect no problems with other reasonably modern servers.

The recipe lets urllib.FancyURLopener do all the hard work of adding a new header, as well as the normal handshaking. I had to subclass the standard class from urllib only to make it known that the error 206 is not really an error in this case—so you can proceed normally. In the function, I also perform extra checks to quit the download if I’ve already downloaded the entire file.

Check out HTTP 1.1 RFC (2616) to learn more about the meaning of the headers. You may find a header that is especially useful, and Python’s urllib lets you send any header you want.

14.7. Handling Cookies While Fetching Web Pages

Credit: Mike Foord, Nikos Kouremenos

Problem

You need to fetch web pages (or other resources from the web) that require you to handle cookies (e.g., save cookies you receive and also reload and send cookies you had previously received from the same site).

Solution

The Python 2.4 Standard Library provides a cookielib module exactly for this task. For Python 2.3, a third-party ClientCookie module works similarly. We can write our code to ensure usage of the best available cookie-handling module—including none at all, in which case our program will still run but without saving and resending cookies. (In some cases, this might still be OK, just maybe slower.) Here is a script to show how this concept works in practice:

import os.path, urllib2
from urllib2 import urlopen, Request
COOKIEFILE = 'cookies.lwp'   # "cookiejar" file for cookie saving/reloading
# first try getting the best possible solution, cookielib:
try:
    import cookielib
except ImportError:                 # no cookielib, try ClientCookie instead
    cookielib = None
    try:
        import ClientCookie
    except ImportError:             # nope, no cookies today
        cj = None                   # so, in particular, no cookie jar
    else:                           # using ClientCookie, prepare everything
        urlopen = ClientCookie.urlopen
        cj = ClientCookie.LWPCookieJar( )
        Request = ClientCookie.Request
else:                               # we do have cookielib, prepare the jar
    cj = cookielib.LWPCookieJar( )
# Now load the cookies, if any, and build+install an opener using them
if cj is not None:
    if os.path.isfile(COOKIEFILE):
        cj.load(COOKIEFILE)
    if cookielib:
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        urllib2.install_opener(opener)
    else:
        opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj))
        ClientCookie.install_opener(opener)
# for example, try a URL that sets a cookie
theurl = 'http://www.diy.co.uk'
txdata = None  # or, for POST instead of GET, txdata=urrlib.urlencode(somedict)
txheaders =  {'User-agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
try:
    req = Request(theurl, txdata, txheaders)  # create a request object
    handle = urlopen(req)                     # and open it
except IOError, e:
    print 'Failed to open "%s".' % theurl
    if hasattr(e, 'code'):
        print 'Error code: %s.' % e.code
else:
    print 'Here are the headers of the page:'
    print handle.info( )
# you can also use handle.read( ) to get the page, handle.geturl( ) to get the
# the true URL (could be different from `theurl' if there have been redirects)
if cj is None:
    print "Sorry, no cookie jar, can't show you any cookies today"
else:
    print 'Here are the cookies received so far:'
    for index, cookie in enumerate(cj):
        print index, ': ', cookie
    cj.save(COOKIEFILE)                     # save the cookies again

Discussion

The third-party module ClientCookie, available for download at http://wwwsearch.sourceforge.net/ClientCookie/, was so successful that, in Python 2.4, its functionality has been added to the Python Standard Library—specifically, the cookie-handling parts in the new module cookielib, the rest in the current version of urllib2.

So, you do need to be careful if you want your code to work just as well on any 2.4 installation (using the latest and greatest cookielib) or an installation of Python 2.3 with ClientCookie on top. As long as we’re at it, we might as well handle running on a 2.3 installation that does not have ClientCookie—run anyway, just don’t save and resend cookies when we lack library code to do so. On some sites, the inability to handle cookies will just be a bother and perhaps a performance hit due to the loss of session continuity, but the site will still work. Other sites, of course, will be completely unusable without cookies.

The recipe’s code is an exercise in the careful management of an idiom that’s an essential part of making your Python code portable among releases and installations, while ensuring minimal graceful degradation when third-party modules you’d like to use just aren’t there. The idiom is known as conditional import and is expressed as follows:

try:
    importsomething
except ImportError:            # 'something' not available
  ...code to do without, degrading gracefully...
else:                          # 'something' IS available, hooray!
  ...code to run only when something is there...
# and then, go on with the rest of your program
...code able to run with or w/o `something'...

The use of “conditional import” is particularly delicate in this recipe because ClientCookie and cookielib aren’t drop-in replacements for each other—therefore, careful management is indeed necessary. But, if you study this recipe, you will see that it is not rocket science—it just requires attention.

One key technique is to make double use of a small number of names as “flags”, with value None when the object to which they would normally refer is not available. In this recipe, we do that for cookielib (which refers to the module of that name when there is one, and otherwise to None) and cj (which refers to a cookie-jar object when there is any, and otherwise to None). Even better, when feasible, is to assign names appropriately to refer to the best available object under the circumstances: the recipe does that for variables urlopen and Request. Note how crucial it is for this purpose that Python treats all objects as first class: urlopen is a function, Request is a class, cookielib (if any) a module, cj (if any) an instance object. The distinction, however, doesn’t matter in the least: the name-object reference concept is exactly the same in every case, with total uniformity, simplicity, and power.

When either cookielib or ClientCookie is available, the cookies are saved in a file in cookie jar format (a useful plain-text format that is automatically handled by either module but can also be examined and modified with text editors and other programs). If the file already exists when the program runs, cookies are loaded from the file, ready to be sent back to the appropriate sites.

My reason for developing this code is that I’m developing a cgi-proxy, approx.py (http://www.voidspace.org.uk/atlantibots/pythonutils.html#cgiproxy), which needs to be able to handle cookies when feasible. To keep the proxy usable on various versions of Python, and ensure it degrades gracefully when no cookie-handling library is available, I needed to develop the carefully managed conditional imports that are shown in the recipe’s Solution. I decided to share them in this recipe since, besides the importance of cookie handling, conditional imports are such a generally important Python idiom. Particularly when installing your code on a server you don’t control, it is unfortunately quite common to have little say in which version of Python is running, nor in which third-party extensions are installed—exactly the kind of situation that requires the conditional import technique to ensure your code does the best it can under the circumstances.

14.8. Authenticating with a Proxy for HTTPS Navigation

Credit: John Nielsen

Problem

You need to use httplib for HTTPS navigation through a proxy that requires basic authentication, but httplib out of the box supports HTTPS only through proxies that do not require authentication.

Solution

Unfortunately, it takes a wafer-thin amount of trickery to achieve this recipe’s task. Here is a script that is just tricky enough:

import httplib, base64, socket
# parameters for the script
user = 'proxy_login'; passwd = 'proxy_pass'
host = 'login.yahoo.com'; port = 443
phost = 'proxy_host'; pport = 80
# setup basic authentication
user_pass = base64.encodestring(user+':'+passwd)
proxy_authorization = 'Proxy-authorization: Basic '+user_pass+'
'
proxy_connect = 'CONNECT %s:%s HTTP/1.0
' % (host, port)
user_agent = 'User-Agent: python
'
proxy_pieces = proxy_connect+proxy_authorization+user_agent+'
'
# connect to the proxy
proxy_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
proxy_socket.connect((phost, pport))
proxy_socket.sendall(proxy_pieces+'
')
response = proxy_socket.recv(8192)
status = response.split( )[1]
if status!='200':
    raise IOError, 'Connecting to proxy: status=%s' % status
# trivial setup for SSL socket
ssl = socket.ssl(proxy_socket, None, None)
sock = httplib.FakeSocket(proxy_socket, ssl)
# initialize httplib andreplace the connection's socket with the SSL one
h = httplib.HTTPConnection('localhost')
h.sock = sock
# and finally, use the now-HTTPS httplib connection as you wish
h.request('GET', '/')
r = h.getresponse( )
print r.read( )

Discussion

HTTPS is essentially HTTP spoken on top of an SSL connection rather than a plain socket. So, this recipe connects to the proxy with basic authentication at the very lowest level of Python socket programming, wraps an SSL socket around the proxy connection thus secured, and finally plays a little trick under httplib’s nose to use that laboriously constructed SSL socket in place of the plain socket in an HTTPConnection instance. From that point onwards, you can use the normal httplib approach as you wish.

14.9. Running a Servlet with Jython

Credit: Brian Zhou

Problem

You need to code a servlet using Jython.

Solution

Java (and Jython) are most often deployed server-side, and thus servlets are a typical way of deploying your code. Jython makes servlets very easy to use. Here is a tiny “hello world” example servlet:

import java, javax, sys
class hello(javax.servlet.http.HttpServlet):
    def doGet(self, request, response):
        response.setContentType("text/html")
        out = response.getOutputStream( )
        print >>out, """<html>
<head><title>Hello World</title></head>
<body>Hello World from Jython Servlet at %s!
</body>
</html>
""" % (java.util.Date( ),)
        out.close( )
        return

Discussion

This recipe is no worse than a typical JSP (Java Server Page) (see http://jywiki.sourceforge.net/index.php?JythonServlet for setup instructions). Compare this recipe to the equivalent Java code: with Python, you’re finished coding in the same time it takes to set up the framework in Java. Most of your setup work will be strictly related to Tomcat or whichever servlet container you use. The Jython-specific work is limited to copying jython.jar to the WEB-INF/lib subdirectory of your chosen servlet context and editing WEB-INF/web.xml to add <servlet> and <servlet-mapping> tags so that org.python.util.PyServlet serves the *.py <url-pattern>.

The key to this recipe (like most other Jython uses) is that your Jython scripts and modules can import and use Java packages and classes just as if the latter were Python code or extensions. In other words, all of the Java libraries that you could use with Java code are similarly usable with Python (i.e., Jython) code. This example servlet first uses the standard Java servlet response object to set the resulting page’s content type (to text/html) and to get the output stream. Afterwards, it can print to the output stream, since the latter is a Python file-like object. To further show off your seamless access to the Java libraries, you can also use the Date class of the java.util package, incidentally demonstrating how it can be printed as a string from Jython.

14.10. Finding an Internet Explorer Cookie

Credit: Andy McKay

Problem

You need to find a specific IE cookie.

Solution

Cookies that your browser has downloaded contain potentially useful information, so it’s important to know how to get at them. With Internet Explorer (IE), one simple approach is to access the registry to find where the cookies are, then read them as files. Here is a module with the function you need for that purpose:

import re, os, glob
import win32api, win32con
def _getLocation( ):
    """ Examines the registry to find the cookie folder IE uses """
    key = r'SoftwareMicrosoftWindowsCurrentVersionExplorerShell Folders'
    regkey = win32api.RegOpenKey(win32con.HKEY_CURRENT_USER, key, 0,
        win32con.KEY_ALL_ACCESS)
    num = win32api.RegQueryInfoKey(regkey)[1]
    for x in range(num):
        k = win32api.RegEnumValue(regkey, x)
        if k[0] == 'Cookies':
            return k[1]
def _getCookieFiles(location, name):
    """ Rummages through cookie folder, returns filenames including `name'.
    `name' is normally the domain, e.g 'activestate' to get cookies for
    activestate.com (also e.g. for activestate.foo.com, but you can
    filter out such accidental hits later). """
    filemask = os.path.join(location, '*%s*' % name)
    return glob.glob(filemask)
def _findCookie(filenames, cookie_re):
    """ Look through a group of files for a cookie that satisfies a
    given compiled RE, returning first such cookie found, or None. """
    for file in filenames:
        data = open(file, 'r').read( )
        m = cookie_re.search(data)
        if m: return m.group(1)
def findIECookie(domain, cookie):
    """ Finds the cookie for a given domain from IE cookie files """
    try:
        l = _getLocation( )
    except Exception, err:
        # Print a debug message
        print "Error pulling registry key:", err
        return None
    # Found the key; now find the files and look through them
    f = _getCookieFiles(l, domain)
    if f:
        cookie_re = re.compile('%s
(.*?)
' % cookie)
        return _findCookie(f, cookie_re)
    else:
        print "No cookies for domain (%s) found" % domain
        return None
if _ _name_ _=='_ _main_ _':
    print findIECookie(domain='kuro5hin', cookie='k5-new_session')

Discussion

While Netscape cookies are in a text file, IE keeps cookies as files in a directory, and you need to access the registry to find which directory that is. To access the Windows registry, this recipe uses the PyWin32 Windows-specific Python extensions; as an alternative, you could use the _winreg module that is part of Python’s standard distribution for Windows. This recipe’s code has been tested and works on IE 5 and 6.

In the recipe, the _getLocation function accesses the registry and finds and returns the directory that IE is using for cookie files. The _getCookieFiles function receives the directory as an argument and uses standard module glob to return all filenames in the directory whose names include a particular requested domain name. The _findCookie function opens and reads all such files in turn, until it finds one whose contents satisfy a compiled regular expression that the function receives as an argument. It then returns the substring of the file’s contents corresponding to the first parenthesized group in the regular expression, or None when no satisfactory file is found. As the leading underscore in the names indicates, these are all internal functions, used only as implementation details of the only function this module is meant to expose, namely findIECookie, which uses the other functions to locate and return the value of a specific cookie for a given domain.

An alternative to this recipe would be to write a Python extension, or use calldll or ctypes, to access the InternetGetCookie API function in Wininet.DLL, as documented on MSDN (Microsoft Developer Network).

14.11. Generating OPML Files

Credit: Moshe Zadka, Premshree Pillai, Anna Martelli Ravenscroft

Problem

OPML (Outline Processor Markup Language) is a standard file format for sharing subscription lists used by RSS (Really Simple Syndication) feed readers and aggregators. You want to share your subscription list, but your blogging site provides only a FOAF (Friend-Of-A-Friend) page, not one in the standard OPML format.

Solution

Use urllib2 to open and read the FOAF page and xml.dom to parse the data received; then, output the data in the proper OPML format to a file. For example, LiveJournal is a popular blogging site that provides FOAF pages; here’s a module with the functions you need to turn those pages into OPML files:

#!/usr/bin/python
import sys
import urllib2
import HTMLParser
from xml.dom import minidom, Node
def getElements(node, uri, name):
    ''' recursively yield all elements w/given namespace URI and name '''
    if (node.nodeType==Node.ELEMENT_NODE and
        node.namespaceURI==uri and
        node.localName==name):
        yield node
    for node in node.childNodes:
        for node in getElements(node, uri, name):
            yield node
class LinkGetter(HTMLParser.HTMLParser):
    ''' HTML parser subclass which collecs attributes of link tags '''
    def _ _init_ _(self):
        HTMLParser.HTMLParser._ _init_ _(self)
        self.links = [  ]
    def handle_starttag(self, tag, attrs):
        if tag == 'link':
            self.links.append(attrs)
def getRSS(page):
    ''' given a `page' URL, returns the HREF to the RSS link '''
    contents = urllib2.urlopen(page)
    lg = LinkGetter( )
    try:
        lg.feed(contents.read(1000))
    except HTMLParser.HTMLParserError:
        pass
    links = map(dict, lg.links)
    for link in links:
        if (link.get('rel')=='alternate' and
            link.get('type')=='application/rss+xml'):
             return link.get('href')
def getNicks(doc):
    ''' given an XML document's DOM, `doc', yields a triple of info for
        each contact: nickname, blog URL, RSS URL '''
    for element in getElements(doc, 'http://xmlns.com/foaf/0.1/', 'knows'):
        person, = getElements(element, 'http://xmlns.com/foaf/0.1/', 'Person')
        nick, = getElements(person, 'http://xmlns.com/foaf/0.1/', 'nick')
        text, =  nick.childNodes
        nickText = text.toxml( )
        blog, = getElements(person, 'http://xmlns.com/foaf/0.1/', 'weblog')
        blogLocation = blog.getAttributeNS(
             'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'resource')
        rss = getRSS(blogLocation)
        if rss:
            yield nickText, blogLocation, rss
def nickToOPMLFragment((nick, blogLocation, rss)):
    ''' given a triple (nickname, blog URL, RSS URL), returns a string
        that's the proper OPML outline tag representing that info '''
    return '''
    <outline text="%(nick)s"
             htmlUrl="%(blogLocation)s"
             type="rss"
             xmlUrl="%(rss)s"/>
    ''' % dict(nick=nick, blogLocation=blogLocation, rss=rss)
def nicksToOPML(fout, nicks):
    ''' writes to file `fout' the OPML document representing the
        iterable of contact information `nicks' '''
    fout.write('''<?xml version="1.0" encoding="utf-8"?>
    <opml version="1.0">
    <head><title>Subscriptions</title></head>
    <body><outline title="Subscriptions">
    ''')
    for nick in nicks:
        print nick
        fout.write(nickToOPMLFragment(nick))
    fout.write("</outline></body></opml>
")
def docToOPML(fout, doc):
    ''' writes to file `fout' the OPLM for XML DOM `doc' '''
    nicksToOPML(fout, getNicks(doc))
def convertFOAFToOPML(foaf, opml):
    ''' given URL `foaf' to a FOAF page, writes its OPML equivalent to
        a file named by string `opml' '''
    f = urllib2.urlopen(foaf)
    doc = minidom.parse(f)
    docToOPML(file(opml, 'w'), doc)
def getLJUser(user):
    ''' writes an OPLM file `user'.opml for livejournal's FOAF page '''
    convertFOAFToOPML('http://www.livejournal.com/users/%s/data/foaf' % user,
                      user+".opml")
if _ _name_ _ == '_ _main_ _':
    # example, when this module is run as a main script
    getLJUser('moshez')

Discussion

RSS feeds have become extremely popular for reading news, blogs, wikis, and so on. OPML is one of the standard file formats used to share subscription lists among RSS fans. This recipe generates an OPML file that can be opened with any RSS reader. With an OPML file, you can share your favorite subscriptions with anyone you like, publish it to the Web, and so on.

getElements is a convenience function that gets written in almost every XML DOM-processing application. It recursively scans the document, finding nodes that satisfy certain criteria. This version of getElements is somewhat quick and dirty, but it is good enough for our purposes. getNicks is where the heart of the parsing brains lie. It calls getElements to look for “foaf:knows” nodes, and inside those, it looks for the “foaf:nick” element, which contains the LiveJournal nickname of the user, and uses a generator to yield the nicknames in this FOAF document.

Note an important idiom used four times in the body of getNicks:

    name, =some iterable

The key is the comma after name, which turns the left-hand side of this assignment into a one-item tuple, making the assignment into what’s technically known as an unpacking assignment. Unpacking assignments are of course very popular in Python (see Recipe 19.4 for a technique to make them even more widely applicable) but normally with at least two names on the left of the assignment, such as:

    aname, another =iterable yielding 2 items

The idiom used in getNicks has exactly the same function, but it demands that the iterable yield exactly one item (otherwise, Python raises a ValueError exception). Therefore, the idiom has the same semantics as:

    _templist =some iterable
    if len(_templist) != 1:
        raise ValueError, 'too many values to unpack'
    name = _templist[0]
    del _templist

Obviously, the name, = ... idiom is much cleaner and more compact than this equivalent snippet, which is worth keeping in mind for the next time you need to express the same semantics.

nicksToOPML, together with its helper function nickToOPMLFragment, generates the OPML, while docToOPML ties together getNicks and nicksToOPML into a FOAF->OPML convertor. saveUser is the main function, which actually interacts with the operating system (accessing the network to get the FOAF, and using a file to save the OPML).

The recipe has a specific function getLJUser(user) to work with the LiveJournal (http://www.livejournal.com) friends lists. However, the point is that the main convertFOAFToOPML function is general enough to use for other sites as well. The various helper functions can also come in handy in your own different but related tasks. For example, the getRSS function (with some aid from its helper class LinkGetter) finds and returns a link to the RSS feed (if one exists) for a given web site.

14.12. Aggregating RSS Feeds

Credit: Valentino Volonghi, Peter Cogolo

Problem

You need to aggregate potentially very high numbers of RSS feeds, with top performance and scalability.

Solution

Parsing RSS feeds in Python is best done with Mark Pilgrim’s Universal Feed Parser from http://www.feedparser.org, but aggregation requires a lot of network activity, in addition to parsing.

As for any network task demanding high performance, Twisted is a good starting point. Say that you have in out.py a module that binds a huge list of RSS feed names to a variable named rss_feed, each feed name represented as a tuple consisting of a URL and a description (e.g., you can download a module exactly like this from http://xoomer.virgilio.it/dialtone/out.py.). You can then build an aggregator server on top of that list, as follows:

#!/usr/bin/python
from twisted.internet import reactor, protocol, defer
from twisted.web import client
import feedparser, time, sys, cStringIO
from out import rss_feed as rss_feeds
DEFERRED_GROUPS = 60       # Number of simultaneous connections
INTER_QUERY_TIME = 300     # Max Age (in seconds) of each feed in the cache
TIMEOUT = 30               # Timeout in seconds for the web request
# dict cache's structure will be the following: { 'URL': (TIMESTAMP, value) }
cache = {  }
class FeederProtocol(object):
    def _ _init_ _(self):
        self.parsed = 0
        self.error_list = [  ]
    def isCached(self, site):
        ''' do we have site's feed cached (from not too long ago)? '''
        # how long since we last cached it (if never cached, since Jan 1 1970)
        elapsed_time = time.time( ) - cache.get(site, (0, 0))[0]
        return elapsed_time < INTER_QUERY_TIME
    def gotError(self, traceback, extra_args):
        ''' an error has occurred, print traceback info then go on '''
        print traceback, extra_args
        self.error_list.append(extra_args)
    def getPageFromMemory(self, data, addr):
        ''' callback for a cached page: ignore data, get feed from cache '''
        return defer.succeed(cache[addr][1])
    def parseFeed(self, feed):
        ''' wrap feedparser.parse to parse a string '''
        try: feed+''
        except TypeError: feed = str(feed)
        return feedparser.parse(cStringIO.StringIO(feed))
    def memoize(self, feed, addr):
        ''' cache result from feedparser.parse, and pass it on '''
        cache[addr] = time.time( ), feed
        return feed
    def workOnPage(self, parsed_feed, addr):
        ''' just provide some logged feedback on a channel feed '''
        chan = parsed_feed.get('channel', None)
        if chan:
            print chan.get('title', '(no channel title?)')
        return parsed_feed
    def stopWorking(self, data=None):
        ''' just for testing: we close after parsing a number of feeds.
            Override depending on protocol/interface you use to communicate
            with this RSS aggregator server.
        '''
        print "Closing connection number %d..." % self.parsed
        print "=-"*20
        self.parsed += 1
        print 'Parsed', self.parsed, 'of', self.END_VALUE
        if self.parsed >= self.END_VALUE:
            print "Closing all..."
            if self.error_list:
                print 'Observed', len(self.error_list), 'errors'
                for i in self.error_list:
                    print i
            reactor.stop( )
    def getPage(self, data, args):
        return client.getPage(args, timeout=TIMEOUT)
    def printStatus(self, data=None):
        print "Starting feed group..."
    def start(self, data=None, standalone=True):
        d = defer.succeed(self.printStatus( ))
        for feed in data:
            if self.isCached(feed):
                d.addCallback(self.getPageFromMemory, feed)
                d.addErrback(self.gotError, (feed, 'getting from memory'))
            else:
                # not cached, go and get it from the web directly
                d.addCallback(self.getPage, feed)
                d.addErrback(self.gotError, (feed, 'getting'))
                # once gotten, parse the feed and diagnose possible errors
                d.addCallback(self.parseFeed)
                d.addErrback(self.gotError, (feed, 'parsing'))
                # put the parsed structure in the cache and pass it on
                d.addCallback(self.memoize, feed)
                d.addErrback(self.gotError, (feed, 'memoizing'))
            # now one way or another we have the parsed structure, to
            # use or display in whatever way is most appropriate
            d.addCallback(self.workOnPage, feed)
            d.addErrback(self.gotError, (feed, 'working on page'))
            # for testing purposes only, stop working on each feed at once
            if standalone:
                d.addCallback(self.stopWorking)
                d.addErrback(self.gotError, (feed, 'while stopping'))
        if not standalone:
            return d
class FeederFactory(protocol.ClientFactory):
    protocol = FeederProtocol( )
    def _ _init_ _(self, standalone=False):
        self.feeds = self.getFeeds( )
        self.standalone = standalone
        self.protocol.factory = self
        self.protocol.END_VALUE = len(self.feeds) # this is just for testing
        if standalone:
            self.start(self.feeds)
    def start(self, addresses):
        # Divide into groups all the feeds to download
        if len(addresses) > DEFERRED_GROUPS:
            url_groups = [[  ] for x in xrange(DEFERRED_GROUPS)]
            for i, addr in enumerate(addresses):
                url_groups[i%DEFERRED_GROUPS].append(addr[0])
        else:
            url_groups = [[addr[0]] for addr in addresses]
        for group in url_groups:
            if not self.standalone:
                return self.protocol.start(group, self.standalone)
            else:
                self.protocol.start(group, self.standalone)
    def getFeeds(self, where=None):
        # used for a complete refresh of the feeds, or for testing purposes
        if where is None:
            return rss_feeds
        return None
if _ _name_ _=="_ _main_ _":
    f = FeederFactory(standalone=True)
    reactor.run( )

Discussion

RSS is a lightweight XML format designed for sharing headlines, news, blogs, and other web contents. Mark Pilgrim’s Universal Feed Parser (http://www.feedparser.org) does a great job of parsing “feeds” that can be in various dialects of RSS format into a uniform memory representation based on Python dictionaries. This recipe builds on top of feedparser to provide a full-featured RSS aggregator.

This recipe is scalable to very high numbers of feeds and is usable in multiclient environments. Both characteristics depend essentially on this recipe being built with the powerful Twisted framework for asynchronous network programming. A simple web interface built with Nevow (from http://www.nevow.com) is also part of the latest complete package for this aggregator, which you can download from my blog at http://vvolonghi.blogspot.com/.

An important characteristic of this recipe’s code is that you can easily set the following operating parameters to improve performance:

Number of parallel connections to use for feed downloading
Timeout for each feed request
Maximum age of a feed in the aggregator’s cache

Being able to set these parameters helps you balance performance, network load, and load on the machine on which you’re running the aggregator.

14.13. Turning Data into Web Pages Through Templates

Credit: Valentino Volonghi

Problem

You need to turn some Python data into web pages based on templates, meaning files or strings of HTML code in which the data gets suitably inserted.

Solution

Templating with Python can be accomplished in an incredible number of ways. but my favorite is Nevow.

The Nevow web toolkit works with the Twisted networking framework to provide excellent templating capabilities to web sites that are coded on the basis of Twisted’s powerful asynchronous model. For example, here’s one way to render a list of dictionaries into a web page according to a template, with Nevow and Twisted:

from twisted.application import service, internet
from nevow import rend, loaders, appserver
dct = [{'name':'Mark', 'surname':'White', 'age':'45'},
       {'name':'Valentino', 'surname':'Volonghi', 'age':'21'},
       {'name':'Peter', 'surname':'Parker', 'age':'Unknown'},
      ]
class Pg(rend.Page):
    docFactory = loaders.htmlstr("""
    <html><head><title>Names, Surnames and Ages</title></head>
        <body>
            <ul nevow:data="dct" nevow:render="sequence">
                <li nevow:pattern="item" nevow:render="mapping">
                    <span><nevow:slot name="name"/>&nbsp;</span>
                    <span><nevow:slot name="surname"/>&nbsp;</span>
                    <span><nevow:slot name="age"/></span>
                </li>
            </ul>
        </body>
    </html>
    """)
    def _ _init_ _(self, dct):
        self.data_dct = dct
        rend.Page._ _init_ _(self)
site = appserver.NevowSite( Pg(dct) )
application = service.Application("example")
internet.TCPServer(8080, site).setServiceParent(application)

Save this code to nsa.tac. Now, entering at a shell command prompt twistd -noy nsa.tac serves up the data, formatted into HTML as the template specifies, as a tiny web site. You can visit the site, at http://localhost:8080, by running a browser on the same computer where the twistd command is running. On the command window where twistd is running, you’ll see a lot of information, roughly equivalent to a typical web server’s log file.

Discussion

This recipe uses Twisted (http://www.twistedmatrix.com) for serving a little web site built with Nevow (http://nevow.com/). Twisted is a large and powerful framework for writing all kinds of Python programs that interact with the network (including, of course, web servers). Nevow is a web application construction kit, normally used in cooperation with a Twisted server but usable in other ways. For example, you could write Nevow CGI scripts that can run with any web server. (Unfortunately, CGI scripts’ performance might prove unsatisfactory for many applications, while Twisted’s performance and scalability are outstanding.)

A vast range of choices is available for packages you can use to perform templating with Python. You can look up some of them at http://www.webwareforpython.org/Papers/Templates/ (which lists a dozen packages suitable for use with the Webware web development toolkit), and specific ones at http://htmltmpl.sourceforge.net/, http://freespace.virgin.net/hamish.sanderson/htmltemplate.html, http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52305, http://www.alcyone.com/pyos/empy/, http://www.entrian.com/PyMeld/... and many, many more besides. I definitely don’t claim to have thoroughly tried each and every one of these dozens of templating systems in production situations, and I wonder whether anyone can truthfully make such a claim! However, out of all I have tried, my favorite is Nevow.

Nevow builds web pages by working on the HTML DOM tree. Recipe 14.14 shows how you can build such a DOM tree from within your program by using the stan subsystem of Nevow. This recipe shows that you can also building a DOM tree from HTML source, known as a template. In this case, for simplicity, we keep that template source in a string in our code, and load the DOM for it by calling loaders.htmlstr; more commonly, we keep the template source in a separate .html file, and load the DOM for it by calling loaders.htmlfile.

Examining the HTML string, you will notice it contains, besides standard HTML tags and attributes, a few attributes and one tag from the 'nevow:' namespace, such as 'nevow:slot', 'nevow:data' and 'nevow:render‘. These additions are in accord with the HTML standards, and also, in practice, the additions work with all browsers. They amount to Nevow defining its own small supplementary namespace, so that HTML templates can express directives to Nevow for building a dynamic page from the template together with data coming from Python code. Note that the attributes and tags in the 'nevow:' namespace do not remain in the HTML output from Nevow: you can verify that, as you visit the web page served by this recipe’s script, by asking your browser to “view source”. Nevertheless, it’s important that template files are perfectly correct HTML: this means those files can be edited with all kinds of specialized HTML editor programs! So, like many other templating systems, Nevow chooses to have correct HTML as its input, as well as (of course) as its output.

The 'nevow:data' directive defines the source of the data for the page: in this case, we use the data_dct attribute of the Pg class instance which is building the page. The 'nevow:render' directive defines the method to use for rendering the data into HTML strings. In this case, we use two standard rendering methods supplied by Nevow itself: sequence, for rendering a sequence of items, such as a list, one after the other; and mapping, for rendering items of a mapping, such as a dictionary, based on the items’ keys appearing as name attributes of nevow:slot tags. More generally, we could code our own rendering methods in any class that subclasses rend.Page.

After defining the Pg class, the recipe continues by building a site object, then an application object, then a TCP server on port 8080 using that site and application—all of this building makes up a common Twisted idiom. The source file nsa.tac into which you save the code from this recipe is not meant to be run with the usual python interpreter. Rather, you should run nsa.tac with the twistd command that you installed as part of Twisted’s own installation procedure: twistd handles all the startup, daemonization, and logging issues, depending on the flags we pass to it. That is exactly why, by convention, one should normally use file extension .tac, rather than .py, for source files that are meant to be run with twistd, rather than directly with python—to avoid any confusion.

Given the experimental, toy-like nature of this recipe, you should pass the flags -noy, to ask twistd to run in the foreground and to “log” information to standard output rather than to some file. An even better idea is to read up on twistd in the Twisted documentation, to learn about all the options for the flags.

14.14. Rendering Arbitrary Objects with Nevow

Credit: Valentino Volonghi, Matt Goodall

Problem

You’re writing a web application that uses the Twisted networking framework and the Nevow subsystem for web rendering. You need to be able to render some arbitrary Python objects to a web page.

Solution

Interfaces and adapters are the Twisted and Nevow approach to this task. Here is a toy example web server script to show how they work:

from twisted.application import internet, service
from nevow import appserver, compy, inevow, loaders, rend
from nevow import tags as T
# Define some simple classes to be the example's "application data"
class Person(object):
    def _ _init_ _(self, firstName, lastName, nickname):
        self.firstName = firstName
        self.lastName = lastName
        self.nickname = nickname
class Bookmark(object):
    def _ _init_ _(self, name, url):
        self.name = name
        self.url = url
# Adapter subclasses are the right way to join application data to the web:
class PersonView(compy.Adapter):
    """ Render a full view of a Person. """
    _ _implements_ _ = inevow.IRenderer
    attrs = 'firstName', 'lastName', 'nickname'
    def rend(self, data):
        return T.div(_class="View person") [
            T.p['Person'],
            T.dl[ [(T.dt[attr], T.dd[getattr(self.original, attr)])
                    for attr in self.attrs]
                ]
            ]
class BookmarkView(compy.Adapter):
    """ Render a full view of a Bookmark. """
    _ _implements_ _ = inevow.IRenderer
    attrs = 'name', 'url'
    def rend(self, data):
        return T.div(_class="View bookmark") [
            T.p['Bookmark'],
            T.dl[ [(T.dt[attr], T.dd[getattr(self.original, attr)])
                    for attr in self.attrs]
                ]
            ]
# register the rendering adapters (could be done from a config textfile)compy.registerAdapter(PersonView, Person, inevow.IRenderer)
               compy.registerAdapter(BookmarkView, Bookmark, inevow.IRenderer)
# some example data instances for the 'application'
objs = [
    Person('Valentino', 'Volonghi', 'dialtone'),
    Person('Matt', 'Goodall', 'mg'),
    Bookmark('Nevow', 'http://www.nevow.com'),
    Person('Alex', 'Martelli', 'aleax'),
    Bookmark('Alex', 'http://www.aleax.it/'),
    Bookmark('Twisted', 'http://twistedmatrix.com/'),
    Bookmark('Python', 'http://www.python.org'),
    ]
# a simple Page that renders a list of objects
class Page(rend.Page):
    def render_item(self, ctx, data):
        return inevow.IRenderer(data)
    docFactory = loaders.stan(
        T.html[
            T.body[
                T.ul(data=objs, render=rend.sequence)[
                    T.li(pattern='item')[render_item],
                    ],
                ],
            ]
        )
# start this very-special-purpose tiny toy webserver:
application = service.Application('irenderer')
httpd = internet.TCPServer(8000, appserver.NevowSite(Page( )))
httpd.setServiceParent(application)

Discussion

This recipe’s purpose is to provide an example of how to get Nevow to render instances of application classes directly to a web page. To supply this example, the recipe shows two classes, Person and Bookmark, whose instances contain information which, one can suppose, is coming from a database, or from a file, or from some other site on the web, wherever.

A key point is that the application classes do not get altered in any way to allow their instances to be rendered onto web pages: rather, adaptation is used to allow instances of such classes to be rendered through separate renderer-adapter classes.

We need two different adapters, one each for Person and Bookmark. We code the two adapters as classes PersonView and BookmarkView, each inheriting from compy.Adapter and overriding the rend method.

compy.Adapter is an abstract superclass intended just for this purpose: it accepts as its constructor argument an object that must be adapted to another interface, and holds that object as self.original for its subclasses’ benefit. Each subclass asserts that it implements inevow.IRenderer by listing that interface in its class-level _ _implements_ _ attribute.

inevow.IRenderer is an interface that supplies a rend method. The Nevow rendering pipeline knows about IRenderer and calls the rend method of the interface to serialize objects to HTML. Objects that implement the interface (on their own behalf or as adapters of other objects) can directly become part of the rendering pipeline.

The two key statements of this recipe are the two calls to the registerAdapter function of Nevow’s module compy:

compy.registerAdapter(PersonView, Person, inevow.IRenderer)
compy.registerAdapter(BookmarkView, Bookmark, inevow.IRenderer)

These calls tell Nevow that PersonView is the class to use to adapt any instance of Person to interface IRenderer, and similarly for BookmarkView and Bookmark. So, when the IRenderer interface is called with an instance p of Person as its argument, it automatically returns an adapter that is an instance of PersonView with p as its self.original (and, again, similarly for Bookmark).

Note how accurately this approach distributes appropriate knowledge to the various parts of the software and minimizes coupling among them while strengthening cohesion within each. Nevow itself has no built-in knowledge of any application class nor of any specific adapter: nor does it need any such knowledge. Nevow just specifies the IRenderer interface it needs for rendering and the registerAdapter function used to inform the framework about adaptation connections. Application-level classes neither have nor need any knowledge of the framework at all. Each adapter class knows about the application level class it’s adapting, the interface it’s implementing, and utilities such as the Adapter base class that the framework supplies (just to factor out a little repetitive coding that would be needed otherwise), and the tags mechanism. (The tags mechanism eases dynamic generation of HTML output. However, you could code adapters to return strings with HTML markup directly, if that suited the needs of your specific application better than the tags mechanism does.)

Finally, the recipe includes an example Page class which ties everything together—again, for convenience, using tags to generate the output. Page uses (explicitly) the rend.sequence renderer provided by Nevow to loop over a sequence and render each item, and (implicitly) the various adapters, by “casting” each item to the IRenderer interface. The recipe ends with three lines to build Twisted application and service objects and to put them together, so that running this recipe’s script with Twisted’s twistd general-purpose daemon provides a small demonstration one-page web site running on the local host at port 8000.

A more complete (and complicated) version of this recipe can be found as part of the Nevow 0.3 distribution, downloadable from http://www.nevow.com, as examples/irenderer.tac.

Table of Contents for 14. Web Programming

Create new playlist

Sign In

Sign Up

Chapter 14. Web Programming

Introduction

14.1. Testing Whether CGI Is Working

Problem

Solution

Discussion

See Also

14.2. Handling URLs Within a CGI Script

Problem

Solution

Discussion

See Also

14.3. Uploading Files with CGI

Problem

Solution

Discussion

See Also

14.4. Checking for a Web Page’s Existence

Problem

Solution

Discussion

See Also

14.5. Checking Content Type via HTTP

Problem

Solution

Discussion

See Also

14.6. Resuming the HTTP Download of a File

Problem

Solution

Discussion

See Also

14.7. Handling Cookies While Fetching Web Pages

Problem

Solution

Discussion

See Also

14.8. Authenticating with a Proxy for HTTPS Navigation

Problem

Solution

Discussion

See Also

14.9. Running a Servlet with Jython

Problem

Solution

Discussion

See Also

14.10. Finding an Internet Explorer Cookie

Problem

Solution

Discussion

See Also

14.11. Generating OPML Files

Problem

Solution

Discussion

See Also

14.12. Aggregating RSS Feeds

Problem

Solution

Discussion

See Also

14.13. Turning Data into Web Pages Through Templates

Problem

Solution

Discussion

See Also

14.14. Rendering Arbitrary Objects with Nevow

Problem

Solution

Discussion

See Also

Table of Contents for
14. Web Programming