Credit: Andy McKay
The Web has been a key technology for many years now, and it has become unusual to develop an application that doesn’t involve some aspects of the Web. From showing a help file in a browser to using web services, the Web has become an integral part of most applications.
I came to Python through a rather tortuous path of ASP (Active Server Pages), then Perl, some Zope, and then Python. Looking back, it seems strange that I didn’t find Python earlier, but the dominance of Perl and ASP (and later PHP) in this area makes it difficult for new developers to see the advantages of Python shining through all the other languages.
Unsurprisingly, Python is an excellent language for web
development, and, as a batteries included language,
Python comes with most of the modules you need. The relatively recent
inclusion of xmlrpclib
in the Python
Standard Library is a reassuring indication that batteries continue to
be added as the march of technology requires, making the standard
libraries even more useful. One of the modules I often use is urllib
, which demonstrates the power of a
simple, well-designed module—saving a file from the Web in two lines
(using urlretrieve
) is easy. The
cgi
module is another example of a
module that has enough functionality to work with, but not too much to
make your scripts slow and bloated.
Compared to other languages, Python seems to have an unusually large number of application servers and templating languages. While it’s easy to develop anything for the Web in Python “from scratch”, it would be peculiar and unwise to do so without first looking at the application servers available. Rather than continually recreating dynamic pages and scripts, the community has taken on the task of building these application servers to allow other users to create the content in easy-to-use templating systems.
Zope is the most well-known product in this space and provides an object-oriented interface to web publishing. With features too numerous to mention, Zope allows a robust and powerful object-publishing environment. The new, revolutionary major release, Zope 3, makes Zope more Pythonic and powerful than ever. Quixote and WebWare are two other application servers with similar, highly modular designs. Any of these can be a real help to the overworked web developer who needs to reuse components and to give other users the ability to create web sites. The Twisted network-programming framework, increasingly acknowledged as the best-of-breed Python framework for asynchronous network programming, is also starting to expand into the web application server field, with its newer “Nevow” offshoot, which you’ll also find used in some of the recipes in this chapter.
For all that, an application server is just too much at
times, and a simple CGI script is really all you need. Indeed, the very
first recipe, Recipe
14.1, demonstrates all the ingredients you need to make sure that
your web server and Python CGI scripting setup are working correctly.
Writing a CGI script doesn’t get much simpler than this, although, as
the recipe’s discussion points out, you could use the cgi.test
function to make it even
shorter.
Another common web-related task is the parsing of HTML, either on your own site or on other web sites. Parsing HTML tags correctly is not as simple as many developers first think, as they optimistically assume a few regular expressions or string searches will see them through. However, we have decided to deal with such issues in other chapters, such as Chapter 1, rather than in this one. After all, while HTML was born with and for the Web, these days HTML is also often used in other contexts, such as for distributing documentation. In any case, most web developers create more than just web pages, so, even if you, the reader, primarily identify as a web developer, and maybe turned to this chapter as your first one in the book, you definitely should peruse the rest of the book, too: many relevant, useful recipes in other chapters describe parsing XML, reading network resources, performing systems administration, dealing with images, and many great ideas about developing in Python, testing your programs, and debugging them!
Credit: Jeff Bauer, Carey Evans
You want a simple CGI (Common Gateway Interface) program to use as a starting point for your own CGI programming or to determine whether your setup is functioning properly.
The cgi
module is normally
used in Python CGI programming, but here we use only its escape
function to ensure that the value of
an environment variable doesn’t accidentally look to the browser as
HTML markup. We do all of the real work ourselves in the following
script:
#!/usr/local/bin/python print "Content-type: text/html" print print "<html><head><title>Situation snapshot</title></head><body><pre>" import sys sys.stderr = sys.stdout import os from cgi import escape print "<strong>Python %s</strong>" % sys.version keys = os.environ.keys( ) keys.sort( ) for k in keys: print "%s %s" % (escape(k), escape(os.environ[k])) print "</pre></body></html>"
CGI is a standard that specifies how a web server runs a separate program (often known as a CGI script) that generates a web page dynamically. The protocol specifies how the server provides input and environment data to the script and how the script generates output in return. You can use any language to write your CGI scripts, and Python is well suited for the task.
This recipe is a simple CGI program that takes no input and just
displays the current version of Python and the environment values. CGI
programmers should always have some simple code handy to drop into
their cgi-bin directories. You
should run this script before wasting time slogging through your
Apache configuration files (or whatever other web server you want to
use for CGI work). Of course, cgi.test
does all this and more, but it may,
in fact, do too much. It does so much, and so much is hidden inside
cgi
’s innards, that it’s hard to
tweak it to reproduce any specific problems you may be encountering in
true scripts. Tweaking the program in this recipe, on the other hand,
is very easy, since it’s such a simple program, and all the parts are
exposed.
Besides, this little script is already quite instructive in its
own way. The starting line, #!/usr/local/bin/python
, must give the
absolute path to the Python interpreter with which you want to run
your CGI scripts, so you may need to edit it accordingly. A popular
solution for non-CGI scripts is to have a first line (the so-called
shebang line) that looks something like
this:
#!/usr/bin/env python
However, this approach puts you at the mercy of the PATH
environment setting, since it runs the
first program named python that it
finds on the PATH
, and that may
well not be what you want under CGI, where you don’t fully control the
environment. Incidentally, many web servers implement the shebang line
even when running under non-Unix systems, so that, for CGI use
specifically, it’s not unusual to see Python scripts on Windows start
with a first line such as:
#!c:/python23/python.exe
Another issue you may be contemplating is why the import
statements are not right at the start
of the script, as is the usual Python style, but are preceded by a few
print
statements. The reason is
that import
could fail if the
Python installation is terribly misconfigured. In case of failure,
Python emits diagnostics to standard error (which is typically
directed to your web server logs, depending on how you set up and
configured your web server), and nothing will go to standard output.
The CGI standard demands that all output be on standard output, so we
first ensure that a minimal quantity of output will display a result
to a visiting browser. Then, assuming that import sys
succeeds (if it fails, the whole
Python installation is so badly broken that you can do very little
about it!), we immediately perform the following assignment:
sys.stderr = sys.stdout
This assignment statement ensures that error output will go to
standard output, so that you’ll have a chance to see it in the
visiting browser. You can perform other import
operations or do further work in the
script only when this is done. Another option makes getting tracebacks
for errors in CGI scripts much simpler. Simply add the following at
the start of your script:
import cgitb; cgitb.enable( )
and the standard Python library module cgitb
takes care of whatever else is needed
to get error tracebacks on the browser. However, as already stated,
the point of this recipe is to show how everything is done, rather
than just reusing prepackaged funcitonality.
One last reflection is that, in Python 2.4, instead of the three lines:
keys = os.environ.keys( ) keys.sort( ) for k in keys:
used in the recipe, you could use the single line:
for k in sorted(os.environ):
Unfortunately, since CGI scripts must often run in environments you do not control, I cannot suggest you code to a specific, recent version of Python in this particular case—particularly not a script such as this one, which is meant to let you examine and check out the exact circumstances under which your CGI runs.
Yet another consideration, not strictly related to Python, is that this script is coded to emit correct HTML. Just about all known browsers let you get away with skipping most of the HTML tags that this script outputs, but why skimp on correctness, relying on the browser to patch your holes? It costs little to emit correct HMTL, so you should get into the habit of doing things right, when the cost is so modest. (I wish more authors of web pages, and of programs producing web pages, shared this sentiment. If they did, there would be a lot less broken HTML out on the Web!)
Documentation on the cgi
and
cgitb
standard library modules in
the Library Reference and Python in
a Nutshell; a basic introduction to the CGI protocol is
available at http://hoohoo.ncsa.uiuc.edu/cgi/overview.html.
Credit: Jürgen Hermann
To build a URL within a script, you need information such as the
hostname and script name. According to the CGI standard, the web
server sets up a lot of useful information in the process environment
of a script before it runs the script itself. In a Python script, we
can access the process environment as the dictionary os.environ
, an attribute of the standard
Python library os
module, and
through accesses to the process environment build our own module of
useful helper functions:
import os, string def isSSL( ): """ Return true if we are on an SSL (https) connection. """ return os.environ.get('SSL_PROTOCOL', '') != '' def getScriptname( ): """ Return the scriptname part of the URL ("/path/to/my.cgi"). """ return os.environ.get('SCRIPT_NAME', '') def getPathinfo( ): """ Return the remaining part of the URL. """ pathinfo = os.environ.get('PATH_INFO', '') # Fix for a well-known bug in IIS/4.0 if os.name == 'nt': scriptname = getScriptname( ) if pathinfo.startswith(scriptname): pathinfo = pathinfo[len(scriptname):] return pathinfo def getQualifiedURL(uri=None): """ Return a full URL starting with schema, servername, and port. Specifying uri causes it to be appended to the server root URL (uri must then start with a slash). """ schema, stdport = (('http', '80'), ('https', '443'))[isSSL( )] host = os.environ.get('HTTP_HOST', '') if not host: host = os.environ.get('SERVER_NAME', 'localhost') port = os.environ.get('SERVER_PORT', '80') if port != stdport: host = host + ":" + port result = "%s://%s" % (schema, host) if uri: result = result + uri return result def getBaseURL( ): """ Return a fully qualified URL to this script. """ return getQualifiedURL(getScriptname( ))
URLs can be manipulated in numerous ways, but many CGI scripts have common needs. This recipe collects a few typical high-level functional needs for URL synthesis from within CGI scripts. You should never hard-code hostnames or absolute paths in your scripts. Doing so makes it difficult to port the scripts elsewhere or rename a virtual host. The CGI environment has sufficient information available to avoid such hard-coding. By importing this recipe’s code as a module, you can avoid duplicating code in your scripts to collect and use that information in typical ways.
The recipe works by accessing information in os.environ
, the attribute of Python’s
standard os
module that collects
the process environment of the current process and lets your script
access it as if it were a normal Python dictionary. In particular,
os.environ
has a get
method, just like a normal dictionary
does, that returns either the mapping for a given key or, if that key
is missing, a default value that you supply in the call to get
. This recipe performs all accesses
through os.environ.get
, thus
ensuring sensible behavior even if the relevant environment variables
have been left undefined by your web server (which should never
happen—but not all web servers are free of bugs).
Among the functions presented in this recipe, getQualifiedURL
is the one you’ll use most
often. It transforms a URI (Universal Resource Identifier) into a URL
on the same host (and with the same schema) used by the CGI script
that calls it. It gets the information from the environment variables
HTTP_HOST
, SERVER_NAME
, and SERVER_PORT
. Furthermore, it can handle
secure (https
) as well as normal
(http
) connections, and selects
between the two by using the isSSL
function, which is
also part of this recipe.
Suppose you need to redirect a visiting browser to another location on this same host. Here’s how you can use a function from this recipe, hard-coding only the redirect location on the host itself, but not the hostname, port, and normal or secure schema:
# example redirect header: print "Location:", getQualifiedURL("/go/here")
Documentation on the os
standard library module in the Library
Reference and Python in a Nutshell;
a basic introduction to the CGI protocol is available at http://hoohoo.ncsa.uiuc.edu/cgi/overview.html.
Credit: Noah Spurrier, Georgy Pruss
Net of any security checks, safeguards against denial of service (DOS) attacks, and the like, the task boils down to what’s exemplified in the following CGI script:
#!/usr/local/bin/python import cgi import cgitb; cgitb.enable( ) import os, sys try: import msvcrt # are we on Windows? except ImportError: pass # nope, no problem else: # yep, need to set I/O to binary mode for fd in (0, 1): msvcrt.setmode(fd, os.O_BINARY) UPLOAD_DIR = "/tmp" HTML_TEMPLATE = """<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html><head><title>Upload Files</title> </head><body><h1>Upload Files</h1> <form action="%(SCRIPT_NAME)s" method="POST" enctype="multipart/form-data"> File name: <input name="file_1" type="file"><br> File name: <input name="file_2" type="file"><br> File name: <input name="file_3" type="file"><br> <input name="submit" type="submit"> </form> </body> </html>""" def print_html_form( ): """ print the form to stdout, with action set to this very script (a 'self-posting form': script both displays AND processes the form).""" print "content-type: text/html; charset=iso-8859-1 " print HTML_TEMPLATE % {'SCRIPT_NAME': os.environ['SCRIPT_NAME']} def save_uploaded_file(form_field, upload_dir): """ Save to disk a file just uploaded, form_field being the name of the file input field on the form. No-op if field or file is missing.""" form = cgi.FieldStorage( ) if not form.has_key(form_field): return fileitem = form[form_field] if not fileitem.file: return fout = open(os.path.join(upload_dir, fileitem.filename), 'wb') while True: chunk = fileitem.file.read(100000) if not chunk: break fout.write(chunk) fout.close( ) save_uploaded_file("file_1", UPLOAD_DIR) save_uploaded_file("file_2", UPLOAD_DIR) save_uploaded_file("file_3", UPLOAD_DIR) print_html_form( )
The CGI script shown in this recipe is very bare-bones, but it does get the job done. It’s a self-posting script: it displays the upload form, and it processes the form when the user submits it, complete with any uploaded files. The script just saves files to an upload directory, which in the recipe is simply set to /tmp.
The script as presented takes no precaution against DOS attacks, so a user could try to fill up your disk with endless uploads. If you deploy this script on a system that is accessible to the public, do add checks to limit the number and size of files written to disk, perhaps depending, also, on how much disk space is still available. A version that might perhaps be more to your liking can be found at http://zxw.nm.ru/test_w_upload.py.htm.
Documentation on the cgi
,
cgitb
, and msvcrt
standard library modules in the
Library Reference and Python in a
Nutshell.
Credit: James Thiele, Rogier Steehouder
You want to check whether an HTTP URL corresponds to an existing web page.
Using httplib
allows you to
easily check for a page’s existence without actually downloading the
page itself, just its headers. Here’s a module implementing a function
to perform this task:
""" httpExists.py A quick and dirty way to check whether a web file is there. Usage: >>> import httpExists >>> httpExists.httpExists('http://www.python.org/') True >>> httpExists.httpExists('http://www.python.org/PenguinOnTheTelly') Status 404 Not Found : http://www.python.org/PenguinOnTheTelly False """ import httplib, urlparse def httpExists(url): host, path = urlparse.urlsplit(url)[1:3] if ':' in host: # port specified, try to use it host, port = host.split(':', 1) try: port = int(port) except ValueError: print 'invalid port number %r' % (port,) return False else: # no port specified, use default port port = None try: connection = httplib.HTTPConnection(host, port=port) connection.request("HEAD", path) resp = connection.getresponse( ) if resp.status == 200: # normal 'found' status found = True elif resp.status == 302: # recurse on temporary redirect found = httpExists(urlparse.urljoin(url, resp.getheader('location', ''))) else: # everything else -> not found print "Status %d %s : %s" % (resp.status, resp.reason, url) found = False except Exception, e: print e._ _class_ _, e, url found = False return found def _test( ): import doctest, httpExists return doctest.testmod(httpExists) if _ _name_ _ == "_ _main_ _": _test( )
While this recipe is very simple and runs quite fast (thanks to
the ability to use the HTTP command HEAD
to get just the headers, not the body,
of the page), it may be too simplistic for your specific needs: the
HTTP result codes you might need to deal with may go beyond the simple
200 success code, and 302 temporary redirect, to include permanent
redirects, temporary inaccessibility, permission problems, and so
on.
In my case, I needed to check the correctness of a huge number of mutual links among pages of a site generated by a complex web application on an intranet, so I knew I had the privilege of relying on a simple check for “200 or bust.” At any rate, you can use this simple recipe as a starting point to which to add any refinements you determine you actually need.
Documentation on the urlparse
and httplib
standard library
modules in the Library Reference and
Python in a Nutshell.
Credit: Bob Stockwell
You need to determine whether a URL, or an open file,
obtained from urllib.open
on a URL,
is of a particular content type (such as 'text
' for HTML or 'image
' for GIF).
The content type of any resource can easily be checked through
the pseudo-file that urllib.urlopen
returns for the resource. Here is a function to show how to perform
such checks:
import urllib def isContentType(URLorFile, contentType='text'): """ Tells whether the URL (or pseudofile from urllib.urlopen) is of the required content type (default 'text'). """ try: if isinstance(URLorFile, str): thefile = urllib.urlopen(URLorFile) else: thefile = URLorFile result = thefile.info( ).getmaintype( ) == contentType.lower( ) if thefile is not URLorFile: thefile.close( ) except IOError: result = False # if we couldn't open it, it's of _no_ type! return result
For greater flexibility, this recipe accepts either the result
of a previous call to urllib.urlopen
, or a URL in string form. In
the latter case, the Solution opens the URL with urllib
and, at the end, closes the resulting
pseudo-file again. If the attempt to open the URL fails, the recipe
catches the IOError
and returns a
result of False
, considering that a
URL that cannot be opened is of no type at all, and therefore in
particular is not of the type the caller was checking for.
(Alternatively, you might prefer to propagate the exception; if that’s
what you want, remove the try
and
except
clause headers and the
result = False
assignment that is
the body of the except
clause.)
Whether the pseudo-file was passed in or opened locally from a
URL string, the info
method of the
pseudo-file gives as its result an instance of mimetools.Message
(which doesn’t mean you
need to import mimetools
yourself—urllib
does all that’s
needed). On that object, we can call any of several methods to get the
content type, depending on what exactly we want—gettype
to get both main and subtype with a
slash in between (as in 'text/plain
'), getmaintype
to get the main type (as in
'text
'), or getsubtype
to get the subtype (as in
'plain
'). In this recipe, we want
the main content type.
The string result from all of the type interrogation methods is
always lowercase, so we take the precaution of calling the lower
method on parameter
contentType
as well, before comparing for
equality.
Documentation on the urllib
and mimetools
standard library
modules in the Library Reference and
Python in a Nutshell; a list of important
content types is at http://www.utoronto.ca/ian/books/html4ed/appb/mimetype.html;
a helpful explanation of the significance of content types at
http://ppewww.ph.gla.ac.uk/~flavell/www/content-type.html.
Credit: Chris Moffitt
Downloads of large files are sometimes interrupted. However, a
good HTTP server that supports the Range header lets you resume the
download from where it was interrupted. The standard Python module
urllib
lets you access this
functionality almost seamlessly: you just have to add the required
header and intercept the error code that the server sends to confirm
that it will respond with a partial file. Here is a function, with a
little helper class, to perform this task:
import urllib, os
class myURLOpener(urllib.FancyURLopener):
""" Subclass to override err 206 (partial file being sent); okay for us """
def http_error_206(self, url, fp, errcode, errmsg, headers, data=None):
pass # Ignore the expected "non-error" code
def getrest(dlFile, fromUrl, verbose=0):
myUrlclass = myURLOpener( )
if os.path.exists(dlFile):
outputFile = open(dlFile, "ab")
existSize = os.path.getsize(dlFile)
# If the file exists, then download only the remaindermyUrlclass.addheader("Range","bytes=%s-" % (existSize))
else:
outputFile = open(dlFile, "wb")
existSize = 0
webPage = myUrlclass.open(fromUrl)
if verbose:
for k, v in webPage.headers.items( ):
print k, "=", v
# If we already have the whole file, there is no need to download it again
numBytes = 0
webSize = int(webPage.headers['Content-Length'])
if webSize == existSize:
if verbose:
print "File (%s) was already downloaded from URL (%s)" % (
dlFile, fromUrl)
else:
if verbose:
print "Downloading %d more bytes" % (webSize-existSize)
while True:
data = webPage.read(8192)
if not data:
break
outputFile.write(data)
numBytes = numBytes + len(data)
webPage.close( )
outputFile.close( )
if verbose:
print "downloaded", numBytes, "bytes from", webPage.url
return numbytes
The HTTP Range header lets the web server know that you want only a certain range of data to be downloaded, and this recipe takes advantage of this header. Of course, the server needs to support the Range header, but since the header is part of the HTTP 1.1 specification, it’s widely supported. This recipe has been tested with Apache 1.3 as the server, but I expect no problems with other reasonably modern servers.
The recipe lets urllib.FancyURLopener
do all the hard work
of adding a new header, as well as the normal handshaking. I had to
subclass the standard class from urllib
only to make it known that the error
206 is not really an error in this case—so you can proceed normally.
In the function, I also perform extra checks to quit the download if
I’ve already downloaded the entire file.
Check out HTTP 1.1 RFC (2616) to learn more about the meaning of
the headers. You may find a header that is especially useful, and
Python’s urllib
lets you send any
header you want.
Documentation on the urllib
standard library module in the Library
Reference and Python in a Nutshell;
the HTTP 1.1 RFC (http://www.ietf.org/rfc/rfc2616.txt).
Credit: Mike Foord, Nikos Kouremenos
You need to fetch web pages (or other resources from the web) that require you to handle cookies (e.g., save cookies you receive and also reload and send cookies you had previously received from the same site).
The Python 2.4 Standard Library provides a cookielib
module exactly for this task. For
Python 2.3, a third-party ClientCookie
module works similarly. We can
write our code to ensure usage of the best available cookie-handling
module—including none at all, in which case our program will still run
but without saving and resending cookies. (In some cases, this might
still be OK, just maybe slower.) Here is a script to show how this
concept works in practice:
import os.path, urllib2 from urllib2 import urlopen, Request COOKIEFILE = 'cookies.lwp' # "cookiejar" file for cookie saving/reloading # first try getting the best possible solution, cookielib: try: import cookielib except ImportError: # no cookielib, try ClientCookie instead cookielib = None try: import ClientCookie except ImportError: # nope, no cookies today cj = None # so, in particular, no cookie jar else: # using ClientCookie, prepare everything urlopen = ClientCookie.urlopen cj = ClientCookie.LWPCookieJar( ) Request = ClientCookie.Request else: # we do have cookielib, prepare the jar cj = cookielib.LWPCookieJar( ) # Now load the cookies, if any, and build+install an opener using them if cj is not None: if os.path.isfile(COOKIEFILE): cj.load(COOKIEFILE) if cookielib: opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) else: opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj)) ClientCookie.install_opener(opener) # for example, try a URL that sets a cookie theurl = 'http://www.diy.co.uk' txdata = None # or, for POST instead of GET, txdata=urrlib.urlencode(somedict) txheaders = {'User-agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} try: req = Request(theurl, txdata, txheaders) # create a request object handle = urlopen(req) # and open it except IOError, e: print 'Failed to open "%s".' % theurl if hasattr(e, 'code'): print 'Error code: %s.' % e.code else: print 'Here are the headers of the page:' print handle.info( ) # you can also use handle.read( ) to get the page, handle.geturl( ) to get the # the true URL (could be different from `theurl' if there have been redirects) if cj is None: print "Sorry, no cookie jar, can't show you any cookies today" else: print 'Here are the cookies received so far:' for index, cookie in enumerate(cj): print index, ': ', cookie cj.save(COOKIEFILE) # save the cookies again
The third-party module ClientCookie
, available for download at
http://wwwsearch.sourceforge.net/ClientCookie/,
was so successful that, in Python 2.4, its functionality has been
added to the Python Standard Library—specifically, the cookie-handling
parts in the new module cookielib
,
the rest in the current version of urllib2
.
So, you do need to be careful if you want your code to work just
as well on any 2.4 installation (using the latest and greatest
cookielib
) or an installation of
Python 2.3 with ClientCookie
on
top. As long as we’re at it, we might as well handle running on a 2.3
installation that does not have ClientCookie
—run anyway, just don’t save and
resend cookies when we lack library code to do so. On some sites, the
inability to handle cookies will just be a bother and perhaps a
performance hit due to the loss of session continuity, but the site
will still work. Other sites, of course, will be completely unusable
without cookies.
The recipe’s code is an exercise in the careful management of an idiom that’s an essential part of making your Python code portable among releases and installations, while ensuring minimal graceful degradation when third-party modules you’d like to use just aren’t there. The idiom is known as conditional import and is expressed as follows:
try:
importsomething
except ImportError: # 'something' not available
...code to do without, degrading gracefully...
else: # 'something' IS available, hooray!
...code to run only when something is there...
# and then, go on with the rest of your program
...code able to run with or w/o `something'...
The use of “conditional import” is particularly delicate in this
recipe because ClientCookie
and
cookielib
aren’t drop-in
replacements for each other—therefore, careful management is indeed
necessary. But, if you study this recipe, you will see that it is not
rocket science—it just requires attention.
One key technique is to make double use of a small number of
names as “flags”, with value None
when the object to which they would normally refer is not available.
In this recipe, we do that for cookielib
(which
refers to the module of that name when there is one, and otherwise to
None
) and cj
(which refers to a cookie-jar
object when there is any, and otherwise to None
). Even better, when feasible, is to
assign names appropriately to refer to the best available object under
the circumstances: the recipe does that for variables
urlopen
and Request
. Note how
crucial it is for this purpose that Python treats all objects as first
class: urlopen
is a function,
Request
is a class, cookielib
(if
any) a module, cj
(if any) an instance object. The
distinction, however, doesn’t matter in the least: the name-object
reference concept is exactly the same in every case, with total
uniformity, simplicity, and power.
When either cookielib
or
ClientCookie
is available, the
cookies are saved in a file in cookie jar format (a useful plain-text
format that is automatically handled by either module but can also be
examined and modified with text editors and other programs). If the
file already exists when the program runs, cookies are loaded from the
file, ready to be sent back to the appropriate sites.
My reason for developing this code is that I’m developing a cgi-proxy, approx.py (http://www.voidspace.org.uk/atlantibots/pythonutils.html#cgiproxy), which needs to be able to handle cookies when feasible. To keep the proxy usable on various versions of Python, and ensure it degrades gracefully when no cookie-handling library is available, I needed to develop the carefully managed conditional imports that are shown in the recipe’s Solution. I decided to share them in this recipe since, besides the importance of cookie handling, conditional imports are such a generally important Python idiom. Particularly when installing your code on a server you don’t control, it is unfortunately quite common to have little say in which version of Python is running, nor in which third-party extensions are installed—exactly the kind of situation that requires the conditional import technique to ensure your code does the best it can under the circumstances.
Documentation on the cookielib
and urllib2
standard library modules in the
Library Reference for Python 2.4; ClientCookie
is at http://wwwsearch.sourceforge.net/ClientCookie/.
Credit: John Nielsen
You need to use httplib
for HTTPS navigation through a proxy
that requires basic authentication, but httplib
out of the box supports HTTPS only
through proxies that do not require
authentication.
Unfortunately, it takes a wafer-thin amount of trickery to achieve this recipe’s task. Here is a script that is just tricky enough:
import httplib, base64, socket # parameters for the script user = 'proxy_login'; passwd = 'proxy_pass' host = 'login.yahoo.com'; port = 443 phost = 'proxy_host'; pport = 80 # setup basic authentication user_pass = base64.encodestring(user+':'+passwd) proxy_authorization = 'Proxy-authorization: Basic '+user_pass+' ' proxy_connect = 'CONNECT %s:%s HTTP/1.0 ' % (host, port) user_agent = 'User-Agent: python ' proxy_pieces = proxy_connect+proxy_authorization+user_agent+' ' # connect to the proxy proxy_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) proxy_socket.connect((phost, pport)) proxy_socket.sendall(proxy_pieces+' ') response = proxy_socket.recv(8192) status = response.split( )[1] if status!='200': raise IOError, 'Connecting to proxy: status=%s' % status # trivial setup for SSL socket ssl = socket.ssl(proxy_socket, None, None) sock = httplib.FakeSocket(proxy_socket, ssl) # initialize httplib andreplace the connection's socket with the SSL one h = httplib.HTTPConnection('localhost') h.sock = sock # and finally, use the now-HTTPS httplib connection as you wish h.request('GET', '/') r = h.getresponse( ) print r.read( )
HTTPS is essentially HTTP spoken on top of an SSL connection
rather than a plain socket. So, this recipe connects to the proxy with
basic authentication at the very lowest level of Python socket
programming, wraps an SSL socket around the proxy connection thus
secured, and finally plays a little trick under httplib
’s nose to use that laboriously
constructed SSL socket in place of the plain socket in an HTTPConnection
instance. From that point
onwards, you can use the normal httplib
approach as you wish.
Documentation on the socket
and httplib
standard library
modules in the Library Reference and
Python in a Nutshell.
Credit: Brian Zhou
Java (and Jython) are most often deployed server-side, and thus servlets are a typical way of deploying your code. Jython makes servlets very easy to use. Here is a tiny “hello world” example servlet:
import java, javax, sys class hello(javax.servlet.http.HttpServlet): def doGet(self, request, response): response.setContentType("text/html") out = response.getOutputStream( ) print >>out, """<html> <head><title>Hello World</title></head> <body>Hello World from Jython Servlet at %s! </body> </html> """ % (java.util.Date( ),) out.close( ) return
This recipe is no worse than a typical JSP (Java Server Page)
(see http://jywiki.sourceforge.net/index.php?JythonServlet
for setup instructions). Compare this recipe to the equivalent Java
code: with Python, you’re finished coding in the same time it takes to
set up the framework in Java. Most of your setup work will be strictly
related to Tomcat or whichever servlet container you use. The
Jython-specific work is limited to copying jython.jar to the WEB-INF/lib subdirectory of your chosen
servlet context and editing WEB-INF/web.xml to add <servlet>
and <servlet-mapping>
tags so that
org.python.util.PyServlet
serves
the *.py <url-pattern>
.
The key to this recipe (like most other Jython uses) is that
your Jython scripts and modules can import and use Java packages and
classes just as if the latter were Python code or extensions. In other
words, all of the Java libraries that you could use with Java code are
similarly usable with Python (i.e., Jython) code. This example servlet
first uses the standard Java servlet response
object to set the resulting page’s
content type (to text/html
) and to
get the output stream. Afterwards, it can print
to the output stream, since the latter
is a Python file-like object. To further show off your seamless access
to the Java libraries, you can also use the Date
class of the java.util
package, incidentally
demonstrating how it can be printed as a string from Jython.
Information on Java servlets at http://java.sun.com/products/servlet/; information on JythonServlet at http://jywiki.sourceforge.net/index.php?JythonServlet.
Credit: Andy McKay
Cookies that your browser has downloaded contain potentially useful information, so it’s important to know how to get at them. With Internet Explorer (IE), one simple approach is to access the registry to find where the cookies are, then read them as files. Here is a module with the function you need for that purpose:
import re, os, glob import win32api, win32con def _getLocation( ): """ Examines the registry to find the cookie folder IE uses """ key = r'SoftwareMicrosoftWindowsCurrentVersionExplorerShell Folders' regkey = win32api.RegOpenKey(win32con.HKEY_CURRENT_USER, key, 0, win32con.KEY_ALL_ACCESS) num = win32api.RegQueryInfoKey(regkey)[1] for x in range(num): k = win32api.RegEnumValue(regkey, x) if k[0] == 'Cookies': return k[1] def _getCookieFiles(location, name): """ Rummages through cookie folder, returns filenames including `name'. `name' is normally the domain, e.g 'activestate' to get cookies for activestate.com (also e.g. for activestate.foo.com, but you can filter out such accidental hits later). """ filemask = os.path.join(location, '*%s*' % name) return glob.glob(filemask) def _findCookie(filenames, cookie_re): """ Look through a group of files for a cookie that satisfies a given compiled RE, returning first such cookie found, or None. """ for file in filenames: data = open(file, 'r').read( ) m = cookie_re.search(data) if m: return m.group(1) def findIECookie(domain, cookie): """ Finds the cookie for a given domain from IE cookie files """ try: l = _getLocation( ) except Exception, err: # Print a debug message print "Error pulling registry key:", err return None # Found the key; now find the files and look through them f = _getCookieFiles(l, domain) if f: cookie_re = re.compile('%s (.*?) ' % cookie) return _findCookie(f, cookie_re) else: print "No cookies for domain (%s) found" % domain return None if _ _name_ _=='_ _main_ _': print findIECookie(domain='kuro5hin', cookie='k5-new_session')
While Netscape cookies are in a text file, IE keeps cookies as
files in a directory, and you need to access the registry to find
which directory that is. To access the Windows registry, this recipe
uses the PyWin32
Windows-specific
Python extensions; as an alternative, you could use the _winreg
module that is part of Python’s
standard distribution for Windows. This recipe’s code has been tested
and works on IE 5 and 6.
In the recipe, the _getLocation
function
accesses the registry and finds and returns the directory that IE is
using for cookie files. The _getCookieFiles
function
receives the directory as an argument and uses standard module
glob
to return all filenames in the
directory whose names include a particular requested domain name. The
_findCookie
function opens and reads all such files
in turn, until it finds one whose contents satisfy a compiled regular
expression that the function receives as an argument. It then returns
the substring of the file’s contents corresponding to the first
parenthesized group in the regular expression, or None
when no satisfactory file is found. As
the leading underscore in the names indicates, these are all internal
functions, used only as implementation details of the only function
this module is meant to expose, namely findIECookie
,
which uses the other functions to locate and return the value of a
specific cookie for a given domain.
An alternative to this recipe would be to write a Python
extension, or use calldll
or
ctypes
, to access the InternetGetCookie
API function in Wininet.DLL, as documented on MSDN
(Microsoft Developer Network).
The Unofficial Cookie FAQ (http://www.cookiecentral.com/faq/) is
chock-full of information on cookies; documentation for win32api
and win32con
in PyWin32 (http://starship.python.net/crew/mhammond/win32/Downloads.html)
or ActivePython (http://www.activestate.com/ActivePython/);
Windows API documentation available from Microsoft (http://msdn.microsoft.com);
Mark Hammond and Andy Robinson, Python Programming on
Win32 (O’Reilly); calldll
is available at Sam Rushing’s page
(http://www.nightmare.com/~rushing/dynwin/);
ctypes
is at http://sourceforge.net/projects/ctypes.
Credit: Moshe Zadka, Premshree Pillai, Anna Martelli Ravenscroft
OPML (Outline Processor Markup Language) is a standard file format for sharing subscription lists used by RSS (Really Simple Syndication) feed readers and aggregators. You want to share your subscription list, but your blogging site provides only a FOAF (Friend-Of-A-Friend) page, not one in the standard OPML format.
Use urllib2
to open and read
the FOAF page and xml.dom
to parse
the data received; then, output the data in the proper OPML format to
a file. For example, LiveJournal is a popular blogging site that
provides FOAF pages; here’s a module with the functions you need to
turn those pages into OPML files:
#!/usr/bin/python import sys import urllib2 import HTMLParser from xml.dom import minidom, Node def getElements(node, uri, name): ''' recursively yield all elements w/given namespace URI and name ''' if (node.nodeType==Node.ELEMENT_NODE and node.namespaceURI==uri and node.localName==name): yield node for node in node.childNodes: for node in getElements(node, uri, name): yield node class LinkGetter(HTMLParser.HTMLParser): ''' HTML parser subclass which collecs attributes of link tags ''' def _ _init_ _(self): HTMLParser.HTMLParser._ _init_ _(self) self.links = [ ] def handle_starttag(self, tag, attrs): if tag == 'link': self.links.append(attrs) def getRSS(page): ''' given a `page' URL, returns the HREF to the RSS link ''' contents = urllib2.urlopen(page) lg = LinkGetter( ) try: lg.feed(contents.read(1000)) except HTMLParser.HTMLParserError: pass links = map(dict, lg.links) for link in links: if (link.get('rel')=='alternate' and link.get('type')=='application/rss+xml'): return link.get('href') def getNicks(doc): ''' given an XML document's DOM, `doc', yields a triple of info for each contact: nickname, blog URL, RSS URL ''' for element in getElements(doc, 'http://xmlns.com/foaf/0.1/', 'knows'): person, = getElements(element, 'http://xmlns.com/foaf/0.1/', 'Person') nick, = getElements(person, 'http://xmlns.com/foaf/0.1/', 'nick') text, = nick.childNodes nickText = text.toxml( ) blog, = getElements(person, 'http://xmlns.com/foaf/0.1/', 'weblog') blogLocation = blog.getAttributeNS( 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 'resource') rss = getRSS(blogLocation) if rss: yield nickText, blogLocation, rss def nickToOPMLFragment((nick, blogLocation, rss)): ''' given a triple (nickname, blog URL, RSS URL), returns a string that's the proper OPML outline tag representing that info ''' return ''' <outline text="%(nick)s" htmlUrl="%(blogLocation)s" type="rss" xmlUrl="%(rss)s"/> ''' % dict(nick=nick, blogLocation=blogLocation, rss=rss) def nicksToOPML(fout, nicks): ''' writes to file `fout' the OPML document representing the iterable of contact information `nicks' ''' fout.write('''<?xml version="1.0" encoding="utf-8"?> <opml version="1.0"> <head><title>Subscriptions</title></head> <body><outline title="Subscriptions"> ''') for nick in nicks: print nick fout.write(nickToOPMLFragment(nick)) fout.write("</outline></body></opml> ") def docToOPML(fout, doc): ''' writes to file `fout' the OPLM for XML DOM `doc' ''' nicksToOPML(fout, getNicks(doc)) def convertFOAFToOPML(foaf, opml): ''' given URL `foaf' to a FOAF page, writes its OPML equivalent to a file named by string `opml' ''' f = urllib2.urlopen(foaf) doc = minidom.parse(f) docToOPML(file(opml, 'w'), doc) def getLJUser(user): ''' writes an OPLM file `user'.opml for livejournal's FOAF page ''' convertFOAFToOPML('http://www.livejournal.com/users/%s/data/foaf' % user, user+".opml") if _ _name_ _ == '_ _main_ _': # example, when this module is run as a main script getLJUser('moshez')
RSS feeds have become extremely popular for reading news, blogs, wikis, and so on. OPML is one of the standard file formats used to share subscription lists among RSS fans. This recipe generates an OPML file that can be opened with any RSS reader. With an OPML file, you can share your favorite subscriptions with anyone you like, publish it to the Web, and so on.
getElements
is a convenience function that gets
written in almost every XML DOM-processing application. It recursively
scans the document, finding nodes that satisfy certain criteria. This
version of getElements
is somewhat quick and dirty,
but it is good enough for our purposes. getNicks
is
where the heart of the parsing brains lie. It calls
getElements
to look for “foaf:knows” nodes, and
inside those, it looks for the “foaf:nick” element, which contains the
LiveJournal nickname of the user, and uses a generator to yield the
nicknames in this FOAF document.
Note an important idiom used four times in the body of
getNicks
:
name, =some iterable
The key is the comma after name
,
which turns the left-hand side of this assignment into a one-item
tuple
, making the assignment into
what’s technically known as an unpacking
assignment. Unpacking assignments are of course very
popular in Python (see Recipe 19.4 for a technique
to make them even more widely applicable) but normally with at least
two names on the left of the assignment, such as:
aname, another =iterable yielding 2 items
The idiom used in getNicks
has exactly the same
function, but it demands that the iterable yield exactly
one item (otherwise, Python raises a ValueError
exception). Therefore, the idiom
has the same semantics as:
_templist =some iterable
if len(_templist) != 1:
raise ValueError, 'too many values to unpack'
name = _templist[0]
del _templist
Obviously, the name, =
...
idiom is much cleaner and more compact than this equivalent snippet,
which is worth keeping in mind for the next time you need to express
the same semantics.
nicksToOPML
, together with its helper function
nickToOPMLFragment
, generates the OPML, while
docToOPML
ties together getNicks
and
nicksToOPML
into a FOAF->OPML convertor.
saveUser
is the main function, which actually
interacts with the operating system (accessing the network to get the
FOAF, and using a file to save the OPML).
The recipe has a specific function
getLJUser(user
) to work with the LiveJournal
(http://www.livejournal.com) friends lists.
However, the point is that the main convertFOAFToOPML
function is general enough to use for other sites as well. The various
helper functions can also come in handy in your own different but
related tasks. For example, the getRSS
function (with
some aid from its helper class LinkGetter
) finds and
returns a link to the RSS feed (if one exists) for a given web
site.
About OPML, http://feeds.scripting.com/whatIsOpml; for more on RSS readers, http://blogspace.com/rss/readers; for FOAF Vocabulary Specification, http://xmlns.com/foaf/0.1/.
Credit: Valentino Volonghi, Peter Cogolo
You need to aggregate potentially very high numbers of RSS feeds, with top performance and scalability.
Parsing RSS feeds in Python is best done with Mark Pilgrim’s Universal Feed Parser from http://www.feedparser.org, but aggregation requires a lot of network activity, in addition to parsing.
As for any network task demanding high performance, Twisted is a
good starting point. Say that you have in out.py a module that binds a huge list of
RSS feed names to a variable named rss_feed
, each
feed name represented as a tuple consisting of a URL and a description
(e.g., you can download a module exactly like this from http://xoomer.virgilio.it/dialtone/out.py.).
You can then build an aggregator server on top of that list, as
follows:
#!/usr/bin/python from twisted.internet import reactor, protocol, defer from twisted.web import client import feedparser, time, sys, cStringIO from out import rss_feed as rss_feeds DEFERRED_GROUPS = 60 # Number of simultaneous connections INTER_QUERY_TIME = 300 # Max Age (in seconds) of each feed in the cache TIMEOUT = 30 # Timeout in seconds for the web request # dict cache's structure will be the following: { 'URL': (TIMESTAMP, value) } cache = { } class FeederProtocol(object): def _ _init_ _(self): self.parsed = 0 self.error_list = [ ] def isCached(self, site): ''' do we have site's feed cached (from not too long ago)? ''' # how long since we last cached it (if never cached, since Jan 1 1970) elapsed_time = time.time( ) - cache.get(site, (0, 0))[0] return elapsed_time < INTER_QUERY_TIME def gotError(self, traceback, extra_args): ''' an error has occurred, print traceback info then go on ''' print traceback, extra_args self.error_list.append(extra_args) def getPageFromMemory(self, data, addr): ''' callback for a cached page: ignore data, get feed from cache ''' return defer.succeed(cache[addr][1]) def parseFeed(self, feed): ''' wrap feedparser.parse to parse a string ''' try: feed+'' except TypeError: feed = str(feed) return feedparser.parse(cStringIO.StringIO(feed)) def memoize(self, feed, addr): ''' cache result from feedparser.parse, and pass it on ''' cache[addr] = time.time( ), feed return feed def workOnPage(self, parsed_feed, addr): ''' just provide some logged feedback on a channel feed ''' chan = parsed_feed.get('channel', None) if chan: print chan.get('title', '(no channel title?)') return parsed_feed def stopWorking(self, data=None): ''' just for testing: we close after parsing a number of feeds. Override depending on protocol/interface you use to communicate with this RSS aggregator server. ''' print "Closing connection number %d..." % self.parsed print "=-"*20 self.parsed += 1 print 'Parsed', self.parsed, 'of', self.END_VALUE if self.parsed >= self.END_VALUE: print "Closing all..." if self.error_list: print 'Observed', len(self.error_list), 'errors' for i in self.error_list: print i reactor.stop( ) def getPage(self, data, args): return client.getPage(args, timeout=TIMEOUT) def printStatus(self, data=None): print "Starting feed group..." def start(self, data=None, standalone=True): d = defer.succeed(self.printStatus( )) for feed in data: if self.isCached(feed): d.addCallback(self.getPageFromMemory, feed) d.addErrback(self.gotError, (feed, 'getting from memory')) else: # not cached, go and get it from the web directly d.addCallback(self.getPage, feed) d.addErrback(self.gotError, (feed, 'getting')) # once gotten, parse the feed and diagnose possible errors d.addCallback(self.parseFeed) d.addErrback(self.gotError, (feed, 'parsing')) # put the parsed structure in the cache and pass it on d.addCallback(self.memoize, feed) d.addErrback(self.gotError, (feed, 'memoizing')) # now one way or another we have the parsed structure, to # use or display in whatever way is most appropriate d.addCallback(self.workOnPage, feed) d.addErrback(self.gotError, (feed, 'working on page')) # for testing purposes only, stop working on each feed at once if standalone: d.addCallback(self.stopWorking) d.addErrback(self.gotError, (feed, 'while stopping')) if not standalone: return d class FeederFactory(protocol.ClientFactory): protocol = FeederProtocol( ) def _ _init_ _(self, standalone=False): self.feeds = self.getFeeds( ) self.standalone = standalone self.protocol.factory = self self.protocol.END_VALUE = len(self.feeds) # this is just for testing if standalone: self.start(self.feeds) def start(self, addresses): # Divide into groups all the feeds to download if len(addresses) > DEFERRED_GROUPS: url_groups = [[ ] for x in xrange(DEFERRED_GROUPS)] for i, addr in enumerate(addresses): url_groups[i%DEFERRED_GROUPS].append(addr[0]) else: url_groups = [[addr[0]] for addr in addresses] for group in url_groups: if not self.standalone: return self.protocol.start(group, self.standalone) else: self.protocol.start(group, self.standalone) def getFeeds(self, where=None): # used for a complete refresh of the feeds, or for testing purposes if where is None: return rss_feeds return None if _ _name_ _=="_ _main_ _": f = FeederFactory(standalone=True) reactor.run( )
RSS is a lightweight XML format designed for sharing headlines,
news, blogs, and other web contents. Mark Pilgrim’s Universal Feed
Parser (http://www.feedparser.org) does a great job of
parsing “feeds” that can be in various dialects of RSS format into a
uniform memory representation based on Python dictionaries. This
recipe builds on top of feedparser
to provide a full-featured RSS aggregator.
This recipe is scalable to very high numbers of feeds and is usable in multiclient environments. Both characteristics depend essentially on this recipe being built with the powerful Twisted framework for asynchronous network programming. A simple web interface built with Nevow (from http://www.nevow.com) is also part of the latest complete package for this aggregator, which you can download from my blog at http://vvolonghi.blogspot.com/.
An important characteristic of this recipe’s code is that you can easily set the following operating parameters to improve performance:
Number of parallel connections to use for feed downloading
Timeout for each feed request
Maximum age of a feed in the aggregator’s cache
Being able to set these parameters helps you balance performance, network load, and load on the machine on which you’re running the aggregator.
Universal Feed Parser is at http://www.feedparser.org; the latest version of this RSS aggregator is at http://vvolonghi.blogspot.com/; Twisted is at http://twistedmatrix.com/.
Credit: Valentino Volonghi
You need to turn some Python data into web pages based on templates, meaning files or strings of HTML code in which the data gets suitably inserted.
Templating with Python can be accomplished in an incredible number of ways. but my favorite is Nevow.
The Nevow web toolkit works with the Twisted networking framework to provide excellent templating capabilities to web sites that are coded on the basis of Twisted’s powerful asynchronous model. For example, here’s one way to render a list of dictionaries into a web page according to a template, with Nevow and Twisted:
from twisted.application import service, internet from nevow import rend, loaders, appserver dct = [{'name':'Mark', 'surname':'White', 'age':'45'}, {'name':'Valentino', 'surname':'Volonghi', 'age':'21'}, {'name':'Peter', 'surname':'Parker', 'age':'Unknown'}, ] class Pg(rend.Page): docFactory = loaders.htmlstr(""" <html><head><title>Names, Surnames and Ages</title></head> <body> <ul nevow:data="dct" nevow:render="sequence"> <li nevow:pattern="item" nevow:render="mapping"> <span><nevow:slot name="name"/> </span> <span><nevow:slot name="surname"/> </span> <span><nevow:slot name="age"/></span> </li> </ul> </body> </html> """) def _ _init_ _(self, dct): self.data_dct = dct rend.Page._ _init_ _(self) site = appserver.NevowSite( Pg(dct) ) application = service.Application("example") internet.TCPServer(8080, site).setServiceParent(application)
Save this code to nsa.tac. Now, entering at a shell command prompt twistd -noy nsa.tac serves up the data, formatted into HTML as the template specifies, as a tiny web site. You can visit the site, at http://localhost:8080, by running a browser on the same computer where the twistd command is running. On the command window where twistd is running, you’ll see a lot of information, roughly equivalent to a typical web server’s log file.
This recipe uses Twisted (http://www.twistedmatrix.com) for serving a little web site built with Nevow (http://nevow.com/). Twisted is a large and powerful framework for writing all kinds of Python programs that interact with the network (including, of course, web servers). Nevow is a web application construction kit, normally used in cooperation with a Twisted server but usable in other ways. For example, you could write Nevow CGI scripts that can run with any web server. (Unfortunately, CGI scripts’ performance might prove unsatisfactory for many applications, while Twisted’s performance and scalability are outstanding.)
A vast range of choices is available for packages you can use to perform templating with Python. You can look up some of them at http://www.webwareforpython.org/Papers/Templates/ (which lists a dozen packages suitable for use with the Webware web development toolkit), and specific ones at http://htmltmpl.sourceforge.net/, http://freespace.virgin.net/hamish.sanderson/htmltemplate.html, http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52305, http://www.alcyone.com/pyos/empy/, http://www.entrian.com/PyMeld/... and many, many more besides. I definitely don’t claim to have thoroughly tried each and every one of these dozens of templating systems in production situations, and I wonder whether anyone can truthfully make such a claim! However, out of all I have tried, my favorite is Nevow.
Nevow builds web pages by working on the HTML DOM tree. Recipe 14.14 shows how you
can build such a DOM tree from within your program by using the
stan
subsystem of Nevow. This
recipe shows that you can also building a DOM tree from HTML source,
known as a template. In this case, for
simplicity, we keep that template source in a string in our code, and
load the DOM for it by calling loaders.htmlstr
; more commonly, we keep the
template source in a separate .html file, and load the DOM for it by
calling loaders.htmlfile
.
Examining the HTML string, you will notice it contains, besides
standard HTML tags and attributes, a few attributes and one tag from
the 'nevow:
' namespace, such as
'nevow:slot
', 'nevow:data
' and 'nevow:render
‘. These additions are in accord
with the HTML standards, and also, in practice, the additions work
with all browsers. They amount to Nevow defining its own small
supplementary namespace, so that HTML templates can express directives
to Nevow for building a dynamic page from the template together with
data coming from Python code. Note that the attributes and tags in the
'nevow:
' namespace do
not remain in the HTML output from Nevow: you can
verify that, as you visit the web page served by this recipe’s script,
by asking your browser to “view source”. Nevertheless, it’s important
that template files are perfectly correct HTML: this means those files
can be edited with all kinds of specialized HTML editor programs! So,
like many other templating systems, Nevow chooses to have correct HTML
as its input, as well as (of course) as its output.
The 'nevow:data
' directive
defines the source of the data for the page: in this case, we use the
data_dct
attribute of the Pg
class
instance which is building the page. The 'nevow:render
' directive defines the method
to use for rendering the data into HTML strings. In this case, we use
two standard rendering methods supplied by Nevow itself:
sequence
, for rendering a sequence of items, such as
a list, one after the other; and mapping
, for
rendering items of a mapping, such as a dictionary, based on the
items’ keys appearing as name
attributes of nevow:slot
tags. More generally, we could
code our own rendering methods in any class that subclasses rend.Page
.
After defining the Pg
class, the recipe
continues by building a site object, then an application object, then
a TCP server on port 8080 using that site and application—all of this
building makes up a common Twisted idiom. The source file nsa.tac into which you save the code from
this recipe is not meant to be run with the usual python interpreter. Rather, you should run
nsa.tac with the twistd command that you installed as part of
Twisted’s own installation procedure: twistd handles all the startup,
daemonization, and logging issues, depending on the flags we pass to
it. That is exactly why, by convention, one should normally use file
extension .tac, rather than
.py, for source files that are
meant to be run with twistd, rather
than directly with python—to avoid
any confusion.
Given the experimental, toy-like nature of this recipe, you should pass the flags -noy, to ask twistd to run in the foreground and to “log” information to standard output rather than to some file. An even better idea is to read up on twistd in the Twisted documentation, to learn about all the options for the flags.
Twisted is at http://www.twistedmatrix.com; Nevow is at http://nevow.com/.
Credit: Valentino Volonghi, Matt Goodall
You’re writing a web application that uses the Twisted networking framework and the Nevow subsystem for web rendering. You need to be able to render some arbitrary Python objects to a web page.
Interfaces and adapters are the Twisted and Nevow approach to this task. Here is a toy example web server script to show how they work:
from twisted.application import internet, service from nevow import appserver, compy, inevow, loaders, rend from nevow import tags as T # Define some simple classes to be the example's "application data" class Person(object): def _ _init_ _(self, firstName, lastName, nickname): self.firstName = firstName self.lastName = lastName self.nickname = nickname class Bookmark(object): def _ _init_ _(self, name, url): self.name = name self.url = url # Adapter subclasses are the right way to join application data to the web: class PersonView(compy.Adapter): """ Render a full view of a Person. """ _ _implements_ _ = inevow.IRenderer attrs = 'firstName', 'lastName', 'nickname' def rend(self, data): return T.div(_class="View person") [ T.p['Person'], T.dl[ [(T.dt[attr], T.dd[getattr(self.original, attr)]) for attr in self.attrs] ] ] class BookmarkView(compy.Adapter): """ Render a full view of a Bookmark. """ _ _implements_ _ = inevow.IRenderer attrs = 'name', 'url' def rend(self, data): return T.div(_class="View bookmark") [ T.p['Bookmark'], T.dl[ [(T.dt[attr], T.dd[getattr(self.original, attr)]) for attr in self.attrs] ] ] # register the rendering adapters (could be done from a config textfile)compy.registerAdapter(PersonView, Person, inevow.IRenderer) compy.registerAdapter(BookmarkView, Bookmark, inevow.IRenderer) # some example data instances for the 'application' objs = [ Person('Valentino', 'Volonghi', 'dialtone'), Person('Matt', 'Goodall', 'mg'), Bookmark('Nevow', 'http://www.nevow.com'), Person('Alex', 'Martelli', 'aleax'), Bookmark('Alex', 'http://www.aleax.it/'), Bookmark('Twisted', 'http://twistedmatrix.com/'), Bookmark('Python', 'http://www.python.org'), ] # a simple Page that renders a list of objects class Page(rend.Page): def render_item(self, ctx, data): return inevow.IRenderer(data) docFactory = loaders.stan( T.html[ T.body[ T.ul(data=objs, render=rend.sequence)[ T.li(pattern='item')[render_item], ], ], ] ) # start this very-special-purpose tiny toy webserver: application = service.Application('irenderer') httpd = internet.TCPServer(8000, appserver.NevowSite(Page( ))) httpd.setServiceParent(application)
This recipe’s purpose is to provide an example of how to get Nevow to render instances of application classes directly to a web page. To supply this example, the recipe shows two classes, Person and Bookmark, whose instances contain information which, one can suppose, is coming from a database, or from a file, or from some other site on the web, wherever.
A key point is that the application classes do not get altered in any way to allow their instances to be rendered onto web pages: rather, adaptation is used to allow instances of such classes to be rendered through separate renderer-adapter classes.
We need two different adapters, one each for
Person
and Bookmark
. We code the two
adapters as classes PersonView
and
BookmarkView
, each inheriting from compy.Adapter
and overriding the rend
method.
compy.Adapter
is an abstract
superclass intended just for this purpose: it accepts as its
constructor argument an object that must be adapted to another
interface, and holds that object as self.original
for its subclasses’ benefit.
Each subclass asserts that it implements inevow.IRenderer
by listing that interface
in its class-level _ _implements_ _
attribute.
inevow.IRenderer
is an
interface that supplies a rend
method. The Nevow rendering pipeline knows about IRenderer
and calls the rend
method of the interface to serialize
objects to HTML. Objects that implement the interface (on their own
behalf or as adapters of other objects) can directly become part of
the rendering pipeline.
The two key statements of this recipe are the two calls to the
registerAdapter
function of Nevow’s
module compy
:
compy.registerAdapter(PersonView, Person, inevow.IRenderer) compy.registerAdapter(BookmarkView, Bookmark, inevow.IRenderer)
These calls tell Nevow that PersonView
is the
class to use to adapt any instance of Person
to
interface IRenderer
, and similarly
for BookmarkView
and Bookmark
. So,
when the IRenderer
interface is
called with an instance p
of
Person
as its argument, it automatically returns an
adapter that is an instance of PersonView
with
p
as its self.original
(and, again, similarly for
Bookmark
).
Note how accurately this approach distributes appropriate
knowledge to the various parts of the software and minimizes coupling
among them while strengthening cohesion within each. Nevow itself has
no built-in knowledge of any application class nor of any specific
adapter: nor does it need any such knowledge. Nevow just specifies the
IRenderer
interface it needs for
rendering and the registerAdapter
function used to inform the framework about adaptation connections.
Application-level classes neither have nor need any knowledge of the
framework at all. Each adapter class knows about the application level
class it’s adapting, the interface it’s implementing, and utilities
such as the Adapter
base class that
the framework supplies (just to factor out a little repetitive coding
that would be needed otherwise), and the tags
mechanism. (The tags mechanism eases
dynamic generation of HTML output. However, you could code adapters to
return strings with HTML markup directly, if that suited the needs of
your specific application better than the tags
mechanism does.)
Finally, the recipe includes an example Page
class which ties everything together—again, for convenience, using
tags
to generate the output.
Page
uses (explicitly) the rend.sequence
renderer provided by Nevow to
loop over a sequence and render each item, and (implicitly) the
various adapters, by “casting” each item to the IRenderer
interface. The recipe ends with
three lines to build Twisted application and service objects and to
put them together, so that running this recipe’s script with Twisted’s
twistd
general-purpose daemon
provides a small demonstration one-page web site running on the local
host at port 8000.
A more complete (and complicated) version of this recipe can be found as part of the Nevow 0.3 distribution, downloadable from http://www.nevow.com, as examples/irenderer.tac.
Nevow is at http://www.nevow.com; Twisted is at http://twistedmatrix.com/.