The PyXML package contains XML parsers, including PyExpat, as well as support for SAX and DOM, and much more. While learning the ropes of the PyXML package, it would be nice to have a comprehensive list of all the classes and methods. Since this is a programming book, it seems appropriate to write a Python program to extract the information we need—and in XML, no less!
Let’s generate an XML file that details each of the files in the PyXML package, the classes
therein, and the methods of the class. This process allows us to
generate quick, usable XML. Rather than a replacement for all the snazzy
code-to-documentation generators out there, Example 3-8 shows a simple, quick
way to generate XML that we can experiment with and use throughout the
examples in this chapter. After all, when manipulating XML, it helps to
have a few hundred thousand bytes of it sitting around to play with.
(This program also demonstrates the simplicity of examining all the files in a directory tree
in using the os.path.walk
function.)
""" genxml.py Descends PyXML tree, indexing source files and creating XML tags for use in navigating the source. """ import os import sys from xml.sax.saxutils import escape def process(filename, fp): print "* Processing:", filename, # parse the file pyFile = open(filename) fp.write("<file name="" + filename + ""> ") inClass = 0 line = pyFile.readline( ) while line: line = line.strip( ) if line.startswith("class") and line[-1] == ":": if inClass: fp.write(" </class> ") inClass = 1 fp.write(" <class name='" + line[:-1] + "'> ") elif line.find("def") > 0 and line[:-1] == ":" and inClass: fp.write(" <method name='" + escape(line[:-1]) + "'/> ") line = pyFile.readline( ) pyFile.close( ) if inClass: fp.write(" </class> ") inClass = 0 fp.write("</file> ") def finder(fp, dirname, names): """Add files in the directory dirname to a list.""" for name in names: if name.endswith(".py"): path = os.path.join(dirname, name) if os.path.isfile(path): process(path, fp) def main( ): print "[genxml.py started]" xmlFd = open("pyxml.xml", "w") xmlFd.write("<?xml version="1.0"?> ") xmlFd.write("<pyxml> ") os.path.walk(sys.argv[1], finder, xmlFd) xmlFd.write("</pyxml>") xmlFd.close( ) print "[genxml.py finished]" if __name__ == "__main__": main( )
The main
function in Example 3-8 uses the os.path.walk
function to search your PyXML
directory for Python files. For each Python source file that exists
beneath the starting directory (the argument to the script), the
process
function is called to extract
class information. That function writes the extracted information into
the open XML file.
At this point, the script proceeds to parse each Python source
file, highlighting each of the classes and methods contained within them
by parsing each line for relevant keywords such as class
and def
:
def process(filename, fp): print "* Processing:", filename, # parse the file pyFile = open(filename) fp.write("<file name='" + filename + "'> ") inClass = 0 line = pyFile.readline() while line: line = line.strip()
When the program finds a class declaration, it creates the
appropriate class
tag and attributes
within the XML document:
if line.startswith("class") and line[-1] == ":": if inClass: fp.write(" </class> ") inClass = 1 fp.write(" <class name='" + line[:-1] + "'> ")
When the program encounters a method definition, it replaces special characters with entities so they don’t cause problems in the XML. The method definition string is trimmed, and then surrounded with the appropriate markup:
elif line.find("def") > 0 and line[:-1] == ":" and inClass: fp.write(" <method name='" + escape(line[:-1]) + "'/> ") line = pyFile.readline()
After a file is complete, the program closes out the last class it was in, if any, and closes out the file tag as well:
pyFile.close() if inClass: fp.write(" </class> ") inClass = 0 fp.write("</file> ")
Python simplifies the work of parsing the text. Each line is
manipulated quite a bit, quotation marks are escaped with entities
(using the escape
function from the
xml.sax.saxutils
module), and XML
tags are placed around class definitions and method names.
To run this program from the shell:
$> python genxml.py /home/chris/PyXML/xml
The parameter to the script is the path to your PyXML source directory (including the xml subdirectory).
The XML that is generated is placed in a file called pyxml.xml. Each file element looks something like this:
<file name="../xml/dom/ext/reader/Sax2Lib.py"> <class name="class LexicalHandler"> <method name="def xmlDecl(self, version, encoding, standalone)"/> <method name="def startDTD(self, doctype, publicID, systemID)"/> <method name="def endDTD(self)"/> <method name="def startEntity(self, name)"/> <method name="def endEntity(self, name)"/> <method name="def comment(self, text)"/> <method name="def startCDATA(self)"/> <method name="def endCDATA(self)"/> </class> <class name="class EntityRefList"> <method name="def getLength(self)"/> <method name="def getEntityName(self, index)"/> <method name="def getEntityRefStart(self, index)"/> <method name="def getEntityRefEnd(self, index)"/> <method name="def __len__(self)"/> </class> <class name="class NamespaceHandler"> <method name="def startNamespaceDeclScope(prefix, uri)"/> <method name="def endNamespaceDeclScope(prefix)"/> </class> <class name="class SAXNotSupportedException(Exception)"> </class> </file>
Note that the name
attribute
of the file
tag varies depending
upon what your parameter is to the script (your PyXML source path).
Functions not defined as methods in a class are not included by the
simple parsing loop (hey, this isn’t a compiler!), but you should be
aware that the XML support provided by both the standard library and
the PyXML package includes many useful functions—read the reference
documentation for more information on those. The escape
function we use in this script is a
perfect example of this. If you’re new to Python, you’ll find that
little helper functions are characteristic of Python libraries; most
of the small utilities needed to make the larger facilities easier to
use have already been included, allowing you to concentrate on your
application.
If you spend some time reviewing this XML file, you will start to become familiar with the scope of the PyXML toolkit. A script is provided a little later in this chapter that converts this XML to HTML using the SAX API and Python string manipulation features. Figure 3-2 shows the XML within a browser.
You can finish off this program by implementing the
PyXMLConversionHandler
class. This
class generates HTML from the XML file we created earlier. The process
allows you to load the HTML file into your browser and see all of the
files, classes, and methods within PyXML in formatted text. Create
this class, as shown in Example
3-9, in the file handlers.py.
from xml.sax import ContentHandler class PyXMLConversionHandler(ContentHandler): """A simple handler implementing 3 methods of the SAX interface.""" def __init__(self, fp): """Save the file object that we generate HTML into.""" self.fp = fp def startDocument(self): """Write out the start of the HTML document.""" self.fp.write("<html><body><b> ") def startElement(self, name, attrs): if name == "file": # generate start of HTML s = attrs.get('name', "") self.fp.write("<p>File: %s<br> " % s) print "* Processing:", s elif name == "class": self.fp.write(" " * 3 + "Class: " + attrs.get('name', "") + "<br> ") elif name == "method": self.fp.write(" " * 6 + "Method: " + attrs.get('name', "") + "<br> ") def endDocument(self): """End the HTML document we're generating.""" self.fp.write("</b></body></html>")
While the conversion itself is very straightforward, one
interesting thing to note is that this class writes its output to a
file object passed to the constructor instead of building a string of
XML text in memory. This avoids storing a potentially large buffer in
memory and building it incrementally with many memory copies. If the
string is required to be in memory when the process is complete, the
creator can provide a StringIO
instance as the file to write to; the StringIO
implementation is more efficient at
building a large string than many string concatenations. This is a
Python idiom that has proven its utility over a wide range of
projects.
The main script really isn’t any different from the others we’ve looked at so far. We create the parser and instantiate our handler class, register the handler, and set the parser in motion. This process is shown in Example 3-10.
#!/usr/bin/env python # # generates HTML from pyxml.xml import sys from xml.sax import make_parser from handlers import PyXMLConversionHandler dh = PyXMLConversionHandler(sys.stdout) parser = make_parser( ) parser.setContentHandler(dh) parser.parse(sys.stdin)
The output from this script is written to the standard output stream.