Credit: Paul Prescod, co-author of XML Handbook (Prentice-Hall)
XML has become a central technology for all kinds of information exchange. Today, most new file formats that are invented are based on XML. Most new protocols are based upon XML. It simply isn’t possible to work with the emerging Internet infrastructure without supporting XML. Luckily, Python has had XML support since many versions ago, and Python’s support for XML has kept growing and maturing year after year.
Python and XML are perfect complements. XML is an open standards way of exchanging information. Python is an open source language that processes the information. Python excels at text processing and at handling complicated data structures. XML is text based and is, above all, a way of exchanging complicated data structures.
That said, working with XML is not so seamless that it requires no effort. There is always somewhat of a mismatch between the needs of a particular programming language and a language-independent information representation. So there is often a requirement to write code that reads (i.e., deserializes or parses) and writes (i.e., serializes) XML.
Parsing XML can be done with code written purely in Python, or with a module that is a C/Python mix. Python comes with the fast Expat parser written in C. Many XML applications use the Expat parser, and one of these recipes accesses Expat directly to build its own concept of an ideal in-memory Python representation of an XML document as a tree of “element” objects (an alternative to the standard DOM approach, which I will mention later in this introduction).
However, although Expat is ubiquitous in the XML world, it is far from being the only parser available, or necessarily the best one for any given application. A standard API called SAX allows any XML parser to be plugged into a Python program. The SAX API is demonstrated in several recipes that perform typical tasks such as checking that an XML document is well formed, extracting text from a document, or counting the tags in a document. These recipes should give you a good understanding of how SAX works. One more advanced recipe shows how to use one of SAX’s several auxiliary features, “filtering”, to normalize “text events” that might otherwise happen to get “fragmented”.
XML-RPC is a protocol built on top of XML for sending data structures from one program to another, typically across the Internet. XML-RPC allows programmers to completely hide the implementation languages of the two communicating components. Two components running on different operating systems, written in different languages, can still communicate easily. XML-RPC is built into Python. This chapter does not deal with XML-RPC, because, together with other alternatives for distributed programming, XML-RPC is covered in Chapter 15.
Other recipes in this chapter are a little bit more eclectic, dealing with issues that range from interfacing, to proprietary XML parsers and document formats, to representing an entire XML document in memory as a Python object. One, in particular, shows how to auto-detect the Unicode encoding that an XML document uses without parsing the document. Unicode is central to the definition of XML, so it’s important to understand Python’s Unicode support if you will be doing any sophisticated work with XML.
The PyXML extension package supplies a variety of useful tools for working with XML. PyXML offers a full implementation of the Document Object Model (DOM)—as opposed to the subset bundled with Python itself—and a validating XML parser written entirely in Python. The DOM is a standard API that loads an entire XML document into memory. This can make XML processing easier for complicated structures in which there are many references from one part of the document to another, or when you need to correlate (i.e., compare) more than one XML document. One recipe shows how to use PyXML’s validating parser to validate and process an XML document, and another shows how to remove whitespace-only text nodes from an XML document’s DOM. You’ll find many other examples in the documentation of the PyXML package (http://pyxml.sourceforge.net/).
Other advanced tools that you can find in PyXML or, in some cases, in FourThought’s open source 4Suite package (http://www.4suite.org/) from which much of PyXML derives, include implementations of a variety of XML-related standards, such as XPath, XSLT, XLink, XPointer, and RDF. If PyXML is already an excellent resource for XML power users in Python, 4Suite is even richer and more powerful.
XML has become so pervasive that, inevitably, you will also find
XML-related recipes in other chapters of this book. Recipe 2.26 strips XML markup
in a very rough and ready way. Recipe 1.23 shows how to
insert XML character references while encoding Unicode text. Recipe 10.17, parses a Mac
OS X pinfo
-format XML stream to get
detailed system information. Recipe 11.10 uses Tkinter to
display a XML DOM as a GUI Tree widget. Recipe 14.11 deals with two
XML file formats related to RSS[1] feeds, fetching and parsing a FOAF[2]-format input to produce an OPML[3]-format result—quite a typical XML-related task in today’s
programming, and a good general example of how Python can help you with
such tasks.
For more information on using Python and XML together, see Python and XML by Christopher A. Jones and Fred L. Drake, Jr. (O’Reilly).
Credit: Paul Prescod, Farhad Fouladi
You need to check whether an XML document is well formed (not whether it conforms to a given DTD or schema), and you need to do this check quickly.
SAX (presumably using a fast parser such as Expat underneath) offers a fast, simple way to perform this task. Here is a script to check well-formedness on every file you mention on the script’s command line:
from xml.sax.handler import ContentHandler
from xml.sax import make_parser
from glob import glob
import sys
def parsefile(filename):
parser = make_parser( )parser.setContentHandler(ContentHandler( ))
parser.parse(filename)
for arg in sys.argv[1:]:
for filename in glob(arg):
try:
parsefile(filename)
print "%s is well-formed" % filename
except Exception, e:
print "%s is NOT well-formed! %s" % (filename, e)
A text is a well-formed XML document if it adheres to all the basic syntax rules for XML documents. In other words, it has a correct XML declaration and a single root element, all tags are properly nested, tag attributes are quoted, and so on.
This recipe uses the SAX API with a dummy ContentHandler
that does nothing. Generally,
when we parse an XML document with SAX, we use a ContentHandler
instance to process the
document’s contents. But in this case, we only want to know whether
the document meets the most fundamental syntax constraints of XML;
therefore, we need not do any processing, and the do-nothing handler
suffices.
The parsefile
function parses the whole
document and throws an exception if an error is found. The recipe’s
main code catches any such exception and prints it out like
this:
$ python wellformed.py test.xmltest.xml is NOT well-formed! test.xml:1002:2: mismatched tag
This means that character 2 on line 1,002 has a mismatched tag.
This recipe does not check adherence to a DTD or schema,
which is a separate procedure called
validation. The performance of the script
should be quite good, precisely because it focuses on performing a
minimal irreducible core task. However, sometimes you need to squeeze
out the last drop of performance because you’re checking the
well-formedness of truly huge files. If you know for sure that you do
have Expat, specifically, installed on your system, you may
alternatively choose to use Expat directly instead of SAX. To try this
approach, you can change function parsefile
to the
following code:
import xml.parsers.expat def parsefile(file): parser = xml.parsers.expat.ParserCreate( ) parser.ParseFile(open(file, "r"))
Don’t expect all that much of an improvement in performance when using Expat directly instead of SAX. However, you might gain a little bit.
Recipe 12.2 and
Recipe 12.3, for
other uses of SAX; the PyXML package (http://pyxml.sourceforge.net/) includes the
pure-Python validating parser xmlproc
, which checks the conformance of XML
documents to specific DTDs; the PyRXP package from ReportLab is a
wrapper around the fast validating parser RXP (http://www.reportlab.com/xml/pyrxp.html),
which is available under the GPL license.
Credit: Paul Prescod
You want to get a sense of how often particular elements occur in an XML document, and the relevant counts must be extracted rapidly.
You can subclass SAX’s ContentHandler
to make your own specialized
classes for any kind of task, including the collection of such
statistics:
from xml.sax.handler import ContentHandler import xml.sax class countHandler(ContentHandler): def _ _init_ _(self): self.tags={ } def startElement(self, name, attr): self.tags[name] = 1 + self.tags.get(name, 0) parser = xml.sax.make_parser( ) handler = countHandler( ) parser.setContentHandler(handler) parser.parse("test.xml") tags = handler.tags.keys( ) tags.sort( ) for tag in tags: print tag, handler.tags[tag]
When I start working with a new XML content set, I like to get a
sense of which elements are in it and how often they occur. For this
purpose, I use several small variants of this recipe. I could also
collect attributes just as easily, as you can see, since attributes
are also passed to the startElement
method that I’m overriding. If you add a stack, you can also keep
track of which elements occur within other elements (for this, of
course, you also have to override the endElement
method so you can pop the
stack).
This recipe also works well as a simple example of a SAX
application, usable as the basis for any SAX application. Alternatives
to SAX include pulldom
and minidom
. For any simple processing
(including this example), these alternatives would be overkill,
particularly if the document you are processing is very large. DOM
approaches are generally justified only when you need to perform
complicated editing and alteration on an XML document, when the
document itself is made complicated by references that go back and
forth inside it, or when you need to correlate (i.e., compare)
multiple documents.
ContentHandler
subclasses
offer many other options, and the online Python documentation does a
pretty good job of explaining them. This recipe’s
countHandler
class overrides ContentHandler
’s startElement
method, which the parser calls
at the start of each element, passing as arguments the element’s tag
name as a Unicode string and the collection of attributes. Our
override of this method counts the number of times each tag name
occurs. In the end, we extract the dictionary used for counting and
emit it (in alphabetical order, which we easily obtain by sorting the
keys).
Recipe 12.3 for other uses of SAX.
Credit: Paul Prescod
Once again, subclassing SAX’s ContentHandler
makes this task quite
easy:
from xml.sax.handler import ContentHandler
import xml.sax
import sys
class textHandler(ContentHandler):
def characters(self, ch):sys.stdout.write(ch.encode("Latin-1"))
parser = xml.sax.make_parser( )
handler = textHandler( )
parser.setContentHandler(handler)
parser.parse("test.xml")
Sometimes you want to get rid of XML tags—for example, to re-key a document or to spell-check it. This recipe performs this task and works with any well-formed XML document. It is quite efficient.
In this recipe’s textHandler
class, we
subclass ContentHander
’s characters
method, which the parser calls
for each string of text in the XML document (excluding tags, XML
comments, and processing instructions), passing as the only argument
the piece of text as a Unicode string. We have to encode
this Unicode before we can emit it to
standard output. (See Recipe
1.22 for more information about emitting Unicode to standard
output.) In this recipe, we’re using the Latin-1 (also known as
ISO-8859-1) encoding, which covers all western European alphabets and
is supported by many popular output devices (e.g., printers and
terminal-emulation windows). However, you should use whatever encoding
is most appropriate for the documents you’re handling, as long, of
course, as that encoding is supported by the devices you need to use.
The configuration of your devices may depend on your operating
system’s concepts of locale and code page. Unfortunately, these issues
vary too much between operating systems for me to go into further
detail.
A simple alternative, if you know that handling Unicode is not
going to be a problem, is to use sgmllib
. It’s not quite as fast but somewhat
more robust against XML of dubious well-formedness:
from sgmllib import SGMLParser class XMLJustText(SGMLParser): def handle_data(self, data): print data XMLJustText( ).feed(open('text.xml').read( ))
An even simpler and rougher way to extract text from an XML document is shown in Recipe 2.26.
Recipe 12.1 and Recipe 12.2 for other uses of SAX.
Credit: Paul Prescod
You have XML documents that may use a large variety of Unicode encodings, and you need to find out which encoding each document is using.
This task is one that we need to code ourselves, rather than getting an existing package to perform it, if we want complete generality:
import codecs, encodings """ Caller will hand this library a buffer string, and ask us to convert the buffer, or autodetect what codec the buffer probably uses. """ # 'None' stands for a potentially variable byte ("##" in the XML spec...) autodetect_dict={ # bytepattern : ("name", (0x00, 0x00, 0xFE, 0xFF) : ("ucs4_be"), (0xFF, 0xFE, 0x00, 0x00) : ("ucs4_le"), (0xFE, 0xFF, None, None) : ("utf_16_be"), (0xFF, 0xFE, None, None) : ("utf_16_le"), (0x00, 0x3C, 0x00, 0x3F) : ("utf_16_be"), (0x3C, 0x00, 0x3F, 0x00) : ("utf_16_le"), (0x3C, 0x3F, 0x78, 0x6D) : ("utf_8"), (0x4C, 0x6F, 0xA7, 0x94) : ("EBCDIC"), } def autoDetectXMLEncoding(buffer): """ buffer -> encoding_name The buffer string should be at least four bytes long. Returns None if encoding cannot be detected. Note that encoding_name might not have an installed decoder (e.g., EBCDIC) """ # A more efficient implementation would not decode the whole # buffer at once, but then we'd have to decode a character at # a time looking for the quote character, and that's a pain encoding = "utf_8" # According to the XML spec, this is the default # This code successively tries to refine the default: # Whenever it fails to refine, it falls back to # the last place encoding was setbytes = byte1, byte2, byte3, byte4 = map(ord, buffer[0:4]) enc_info = autodetect_dict.get(bytes, None) if not enc_info: # Try autodetection again, removing potentially # variable bytes bytes = byte1, byte2, None, None enc_info = autodetect_dict.get(bytes) if enc_info: encoding = enc_info # We have a guess...these are # the new defaults # Try to find a more precise encoding using XML declaration secret_decoder_ring = codecs.lookup(encoding)[1] decoded, length = secret_decoder_ring(buffer) first_line = decoded.split(" ", 1)[0] if first_line and first_line.startswith(u"<?xml"): encoding_pos = first_line.find(u"encoding") if encoding_pos!=-1: # Look for double quotes quote_pos = first_line.find('"', encoding_pos) if quote_pos==-1: # Look for single quote quote_pos = first_line.find("'", encoding_pos) if quote_pos>-1: quote_char = first_line[quote_pos] rest = first_line[quote_pos+1:] encoding = rest[:rest.find(quote_char)] return encoding
The XML specification describes the outline of an algorithm for detecting the Unicode encoding that an XML document uses. This recipe implements that algorithm and helps your XML-processing programs determine which encoding is being used by a specific document.
The default encoding (unless we can determine another one
specifically) must be UTF-8, as it is part of the specifications that
define XML. Certain byte patterns in the first four, or sometimes even
just the first two, bytes of the text can identify a different
encoding. For example, if the text starts with the two bytes 0xFF, 0xFE
we can be certain that these
bytes are a byte-order mark that identifies the encoding type as
little-endian (low byte before high byte in each character) and the
encoding itself as UTF-16 (or the 32-bits-per-character UCS-4, if the
next two bytes in the text are 0,
0
).
If we get as far as this, we must also examine the first line of
the text. For this purpose, we decode the text from a bytestring into
Unicode, with the encoding determined so far and detect the first
line-end '
' character. If the
first line begins with u'<?xml
',
it’s an XML declaration and may explicitly specify an encoding by
using the keyword encoding
as an
attribute. The nested if
statements
in the recipe check for that case, and, if they find an encoding thus
specified, the recipe returns the encoding thus found as the encoding
the recipe has determined. This step is absolutely crucial, since any
text starting with the single-byte ASCII-like representation of the
XML declaration, <?xml
, would be
otherwise erroneously identified as encoded in UTF-8, while its
explicit encoding attribute may specify it as being, for example, one
of the ISO-8859 standard encodings.
This recipe makes the assumption that, as the XML specs require, the XML declaration, if any, is terminated by an end-of-line character. If you need to deal with almost-XML documents that are malformed in this very specific way (i.e., an incorrect XML declaration that is not terminated by an end-of-line character), you may need to apply some heuristic adjustments, for example, through regular expressions. However, it’s impossible to offer precise suggestions, since malformedness may come in such a wide variety of errant forms.
This code detects a variety of encodings, including some that are not yet supported by Python’s Unicode decoders. So, the fact that you can decipher the encoding does not guarantee that you can then decipher the document itself!
Unicode is a huge topic, but a recommended book is
Unicode: A Primer, by Tony Graham (Hungry
Minds, Inc.)—details are available at http://www.menteith.com/unicode/primer/;
Library Reference and Python in a
Nutshell document the built-in str
and unicode
types, and modules unidata
and codecs
; Recipe 1.21 and Recipe 1.22.
Credit: John Bair, Christoph Dietze
You want to load an XML document into memory, but you don’t like the complicated access procedures of DOM. You’d prefer something more Pythonic—specifically, you’d like to map the document into a tree of Python objects.
To build our tree of objects, we can directly wrap the fast
expat
parser:
from xml.parsers import expat class Element(object): ''' A parsed XML element ''' def _ _init_ _(self, name, attributes): # Record tagname and attributes dictionary self.name = name self.attributes = attributes # Initialize the element's cdata and children to empty self.cdata = '' self.children = [ ] def addChild(self, element): self.children.append(element) def getAttribute(self, key): return self.attributes.get(key) def getData(self): return self.cdata def getElements(self, name=''): if name: return [c for c in self.children if c.name == name] else: return list(self.children) class Xml2Obj(object) ''' XML to Object converter ''' def _ _init_ _(self): self.root = None self.nodeStack = [ ] def StartElement(self, name, attributes): 'Expat start element event handler' # Instantiate an Element object element = Element(name.encode( ), attributes) # Push element onto the stack and make it a child of parent if self.nodeStack: parent = self.nodeStack[-1] parent.addChild(element) else: self.root = element self.nodeStack.append(element) def EndElement(self, name): 'Expat end element event handler' self.nodeStack[-1].pop( ) def CharacterData(self, data): 'Expat character data event handler' if data.strip( ): data = data.encode( ) element = self.nodeStack[-1] element.cdata += data def Parse(self, filename): # Create an Expat parser Parser = expat.ParserCreate( ) # Set the Expat event handlers to our methods Parser.StartElementHandler = self.StartElement Parser.EndElementHandler = self.EndElement Parser.CharacterDataHandler = self.CharacterData # Parse the XML File ParserStatus = Parser.Parse(open(filename).read( ), 1) return self.root parser = Xml2Obj( ) root_element = parser.Parse('sample.xml')
I saw Christoph Dietze’s recipe (http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/116539) about turning the structure of an XML document into a simple combination of dictionaries and lists and thought it was a really good idea. This recipe is a variation on that idea, with several differences.
For maximum speed, the recipe uses the low-level expat
parser directly. It would get no real
added value from the richer SAX interface, much less from the slow and
memory-hungry DOM approach. Building the parent-children connections
is not hard even with an event-driven interface, as this recipe shows
by using a simple stack for the purpose.
The main difference with respect to Dietze’s original idea is that this recipe loads the XML document into a tree of Python objects (rather than a combination of dictionaries and lists), one per node, with nicely named attributes allowing access to each node’s characteristics—tagname, attributes (as a Python dictionary), character data (i.e., cdata in XML parlance) and children elements (as a Python list).
The various accessor methods of class Element
are, of course, optional. You might prefer to access the attributes
directly. I think they add no complexity and look nicer, but,
obviously, your tastes may differ. This is, after all, just a recipe,
so feel free to alter the mix of seasonings at will!
You can find other similar ideas (e.g., bypass the DOM, build something more Pythonic as the memory representation of an XML document) in many other excellent and more complete projects, such as PyRXP (http://www.reportlab.org/pyrxp.html), ElementTree (http://effbot.org/zone/element-index.htm), and XIST (http://www.livinglogic.de/Python/xist/).
Library Reference and Python
in a Nutshell document the built-in XML support in the
Python Standard Library, and xml.parsers.expat
in particular. PyRXP is at
http://www.reportlab.org/pyrxp.html;
ElementTree is at http://effbot.org/zone/element-index.htm; XIST
is at http://www.livinglogic.de/Python/xist/.
Credit: Brian Quinlan, David Wilson
You want to remove, from the DOM representation of an XML document, all the text nodes within a subtree, which contain only whitespace.
XML parsers consider several complex conditions when deciding which whitespace-only text nodes to preserve during DOM construction. Unfortunately, the result is often not what you want, so it’s helpful to have a function to remove all whitespace-only text nodes from among a given node’s descendants:
def remove_whilespace_nodes(node): """ Removes all of the whitespace-only text decendants of a DOM node. """ # prepare the list of text nodes to remove (and recurse when needed) remove_list = [ ] for child in node.childNodes: if child.nodeType == dom.Node.TEXT_NODE and not child.data.strip( ): # add this text node to the to-be-removed list remove_list.append(child) elif child.hasChildNodes( ): # recurse, it's the simplest way to deal with the subtree remove_whilespace_nodes(child) # perform the removals for node in remove_list: node.parentNode.removeChild(node) node.unlink( )
This recipe’s code works with any correctly implemented Python
XML DOM, including the xml.dom.minidom
that is part of the Python
Standard Library and the more complete DOM implementation that comes
with PyXML.
The implementation of function
remove_whitespace_node
is quite simple but rather
instructive: in the first for
loop
we build a list of all child nodes to remove, and then in a second,
separate loop we do the removal. This precaution is a good example of
a general rule in Python: do not alter the very container you’re
looping on—sometimes you can get away with it, but it is unwise to
count on it in the general case. On the other hand, the function can
perfectly well call itself recursively within its first for
loop because such a call does
not alter the very list node.childNodes
on which the loop is
iterating (it may alter some items in that list,
but it does not alter the list object itself).
Library Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library.
Credit: Thomas Guettler
You have Microsoft Excel spreadsheets saved in XML form, and want to parse them into memory as Python nested lists.
The XML form of Excel spreadsheets is quite simple: all text is
in Cell
tags, which are nested in
Row
tags nested in Table
tags. SAX makes it quite simple to
parse this kind of XML into memory:
import sys from xml.sax import saxutils, parse class ExcelHandler(saxutils.DefaultHandler): def _ _init_ _(self): self.chars = [ ] self.cells = [ ] self.rows = [ ] self.tables = [ ] def characters(self, content): self.chars.append(content) def startElement(self, name, atts): if name=="Cell": self.chars = [ ] elif name=="Row": self.cells=[ ] elif name=="Table": self.rows = [ ] def endElement(self, name): if name=="Cell": self.cells.append(''.join(self.chars)) elif name=="Row": self.rows.append(self.cells) elif name=="Table": self.tables.append(self.rows) if _ _name_ _ == '_ _main_ _': excelHandler = ExcelHandler( ) parse(sys.argv[1], excelHandler) print excelHandler.tables
The structure of the parser presented in this recipe is
pleasingly simple: at each of three logical nesting levels of data, we
collect content into a list. Each time a tag of a given level begins,
we start with an empty list for it; each time the tag ends, we append
the tag’s contents to the list of the next upper level. The net result
is that the top-level list, the one named tables
,
accumulates all of the spreadsheet’s contents with the proper
structure (a triply nested list). At the lowest level, of course, we
join all the text strings that are reported as being within the same
cell into a single cell content text string, when we accumulate,
because the division between the various strings is just an artefact
of the XML parsing process.
For example, consider a tiny spreadsheet with one column and
three rows, where the first two rows each hold the number 2
and the third one holds the number
4
obtained by summing the numbers
in the first two rows with an Excel formula. The relevant snippet of
the Excel XML output (XMLSS format, as Microsoft calls it) is
then:
<Table ss:ExpandedColumnCount="1" ss:ExpandedRowCount="3" x:FullColumns="1" x:FullRows="1"> <Row> <Cell><Data ss:Type="Number">2</Data></Cell> </Row> <Row> <Cell><Data ss:Type="Number">2</Data></Cell> </Row> <Row> <Cell ss:Formula="=SUM(R[-2]C, R[-1]C)"> <Data ss:Type="Number">4</Data></Cell> </Row> </Table>
and running the script in this recipe over this file emits:
[[[u'2'], [u'2'], [u'4']]]
As you can see, the XMLSS file also contains a lot of supplementary information that this recipe is not collecting—the attributes hold information about the type of data (number or string), the formula used for the computation (if any), and so on. If you need any or all of this supplemental information, it’s not hard to enrich this recipe to record and use it.
Library Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library and SAX in particular.
Credit: Paul Sholtz, Jeroen Jeroen, Marius Gedminas
You are handling XML documents and must check the validity with respect to either internal or external DTDs. You possibly also want to perform application-specific processing during the validation process.
You often want to validate an XML document file with respect to
a !DOCTYPE
processing instruction
that the document file contains. On occasion, though, you may want to
force loading of an external DTD from a given file. Moreover, a
frequent need is to also perform application-specific processing
during validation. A function with optional parameters, using modules
from the PyXML package, can accommodate all of these needs:
from xml.parsers.xmlproc import utils, xmlval, xmldtd def validate_xml_file(xml_filename, app=None, dtd_filename=None): # build validating parser object with appropriate error handler parser = xmlval.Validator( ) parser.set_error_handler(utils.ErrorPrinter(parser)) if dtd_filename is not None: # DTD file specified, load and set it as the DTD to use dtd = xmldtd.load_dtd(dtd_filename) parser.val.dtd = parser.dtd = parser.ent = dtd if app is not None: # Application processing requested, set appliation object parser.set_application(app) # everything being set correctly, finally perform the parsing parser.parse_resource(xml_filename)
If your XML data is in a string s
,
rather than in a file, instead of the parse.parse_resource
call, you should use
the following two statements in a variant of the previously shown
function:
parser.feed(s) parser.close( )
Documentation on XML parsing in general, and xmlproc
in particular, is easy enough to
come by. However, XML is a very large subject, and PyXML is a
correspondingly large package. The package’s documentation is often
not entirely complete and up to date; even if it were, finding out how
to perform specific tasks would still take quite a bit of digging.
This recipe shows how to validate documents in a simple way that is
easy to adapt to your specific needs.
If you need to perform application-specific processing, as well
as validation, you need to make your own application object (an
instance of some subclass of xmlproc.xmlproc.Application
that
appropriately overrides some or all of its various methods, most
typically handle_start_tag
,
handle_end_tag
, handle_data
, and doc_end
) and pass the application object as
the app
argument to the
validate_xml_file
function.
If you need to handle errors and warnings differently from the
emitting of copious error messages that xmlproc.utils.ErrorPrinter
performs, you
need to subclass (either that class or its base xmlproc.xmlapp.ErrorHandler
directly) to
perform whatever tweaking you need. (See the sources of the utils.py module for examples; that module
will usually be at relative path _xmlplus/parsers/xmlproc/utils.py in your
Python library directory, after you have installed the PyXML package.)
Then, you need to alter the call to the method set_error_handler
that you see in this
recipe’s validate_xml_file
function so that it uses
an instance of your own error-handling class. You might modify the
validate_xml_file
function to take yet another
optional parameter err=None
for the
purpose, but this way overgeneralization lies. I’ve found ErrorHandler
’s diagnostics normally cover my
applications’ needs, so, in the code shown in this recipe’s Solution,
I have not provided for this specific customization.
The PyXML web site at http://pyxml.sourceforge.net/.
Credit: A.M. Kuchling
While parsing an XML document with SAX, you need to filter out all of the elements and attributes that belong to a particular namespace.
The SAX filter concept is just what we need here:
from xml import sax from xml.sax import handler, saxutils, xmlreader # the namespace we want to remove in our filter RDF_NS = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#' class RDFFilter(saxutils.XMLFilterBase): def _ _init_ _ (self, *args): saxutils.XMLFilterBase._ _init_ _(self, *args) # initially, we're not in RDF, and just one stack level is needed self.in_rdf_stack = [False] def startElementNS(self, (uri, localname), qname, attrs): if uri == RDF_NS or self.in_rdf_stack[-1] == True: # skip elements with namespace, if that namespace is RDF or # the element is nested in an RDF one -- and grow the stack self.in_rdf_stack.append(True) return # Make a dict of attributes that DON'T belong to the RDF namespace keep_attrs = { } for key, value in attrs.items( ): uri, localname = key if uri != RDF_NS: keep_attrs[key] = value # prepare the cleaned-up bunch of non-RDF-namespace attributes attrs = xmlreader.AttributesNSImpl(keep_attrs, attrs.getQNames( )) # grow the stack by replicating the latest entry self.in_rdf_stack.append(self.in_rdf_stack[-1]) # finally delegate the rest of the operation to our base class saxutils.XMLFilterBase.startElementNS(self, (uri, localname), qname, attrs) def characters(self, content): # skip characters that are inside an RDF-namespaced tag being skipped if self.in_rdf_stack[-1]: return # delegate the rest of the operation to our base class saxutils.XMLFilterBase.characters(self, content) def endElementNS (self, (uri, localname), qname): # pop the stack -- nothing else to be done, if we were skipping if self.in_rdf_stack.pop( ) == True: return # delegate the rest of the operation to our base class saxutils.XMLFilterBase.endElementNS(self, (uri, localname), qname) def filter_rdf(input, output): """ filter_rdf(input=some_input_filename, output=some_output_filename) Parses the XML input from the input stream, filtering out all elements and attributes that are in the RDF namespace. """ output_gen = saxutils.XMLGenerator(output) parser = sax.make_parser( ) filter = RDFFilter(parser) filter.setFeature(handler.feature_namespaces, True) filter.setContentHandler(output_gen) filter.setErrorHandler(handler.ErrorHandler( )) filter.parse(input) if _ _name_ _ == '_ _main_ _': import StringIO, sys TEST_RDF = '''<?xml version="1.0"?> <metadata xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <title> This is non-RDF content </title> <rdf:RDF> <rdf:Description rdf:about="%s"> <dc:Creator>%s</dc:Creator> </rdf:Description> </rdf:RDF> <element /> </metadata> ''' input = StringIO.StringIO(TEST_RDF) filter_rdf(input, sys.stdout)
This module, when run as a main script, emits something like:
<?xml version="1.0" encoding="iso-8859-1"?> <metadata xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <title> This is non-RDF content </title> <element></element> </metadata>
My motivation for originally writing this recipe came from processing files of metadata, containing RDF mixed with other elements. I wanted to generate a version of the metadata with the RDF filtered out.
The filter_rdf
function does the job,
reading XML input from the input stream and writing it to the output
stream. The standard XMLGenerator
class in xml.sax.saxutils
is used
to produce the output. Function filter_rdf
internally
uses a filtering class called RDFFilter
, also shown
in this recipe’s Solution, pushing that filter on top of the XML
parser to suppress elements and attributes belonging to the RDF_NS
namespace.
Non-RDF elements contained within an RDF element are also
removed. To modify this behavior, change the first line of the
startElementNS
method to use just
if uri = = RDF_NS
as the
guard.
This code doesn’t delete the xmlns
declaration for the RDF namespace; I’m
willing to live with a little unnecessary but harmless cruft in the
output rather than go to huge trouble to remove it.
Library Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library.
Credit: Uche Ogbuji, James Kew, Peter Cogolo
A SAX parser can report contiguous text using multiple
characters events (meaning, in practice,
multiple calls to the characters
method), and this multiplicity of events for a single text string may
give problems to SAX handlers. You want to insert a filter into the
SAX handler chain to ensure that each text node in the document is
reported as a single SAX characters event (meaning, in practice, that
it calls character
just
once).
Module xml.sax.saxutils
in the standard Python
library includes a class XMLFilterBase
that we can subclass to
implement any XML filter we may need:
from xml.sax.saxutils import XMLFilterBase class text_normalize_filter(XMLFilterBase): """ SAX filter to ensure that contiguous text nodes are merged into one """ def _ _init_ _(self, upstream, downstream): XMLFilterBase._ _init_ _(self, upstream) self._downstream = downstream self._accumulator = [ ] def _complete_text_node(self): if self._accumulator: self._downstream.characters(''.join(self._accumulator)) self._accumulator = [ ] def characters(self, text): self._accumulator.append(text) def ignorableWhitespace(self, ws): self._accumulator.append(text) def _wrap_complete(method_name): def method(self, *a, **k): self._complete_text_node( ) getattr(self._downstream, method_name)(*a, **k) # 2.4 only: method._ _name_ _ = method_name setattr(text_normalize_filter, method_name, method) for n in '''startElement startElementNS endElement endElementNS processingInstruction comment'''.split( ): _wrap_complete(n) if _ _name_ _ == "_ _main_ _": import sys from xml import sax from xml.sax.saxutils import XMLGenerator parser = sax.make_parser( ) # XMLGenerator is a special predefined SAX handler that merely writes # SAX events back into an XML document downstream_handler = XMLGenerator( ) # upstream, the parser; downstream, the next handler in the chain filter_handler = text_normalize_filter(parser, downstream_handler) # The SAX filter base is designed so that the filter takes on much of the # interface of the parser itself, including the "parse" method filter_handler.parse(sys.argv[1])
A SAX parser can report contiguous text using multiple
characters events (meaning, in practice, multiple calls to the
characters
method of the downstream
handler). In other words, given an XML document whose content is
'abc
', the text could technically
be reported as up to three character events: one for the 'a
' character, one for the ‘b', and a third
for the ‘c’. Such an extreme case of “fragmentation” of a text string
into multiple events is unlikely in real life, but it is not
impossible.
A typical reason that might cause a parser to report text nodes a bit at a time would be buffering of the XML input source. Most low-level parsers use a buffer of a certain number of characters that are read and parsed at a time. If a text node straddles such a buffer boundary, many parsers will just wrap up the current text event and start a new one to send characters from the next buffer. If you don’t account for this behavior in your SAX handlers, you may run into very obscure and hard-to-reproduce bugs. Even if the parser you usually use does combine text nodes for you, you never know when you may want to run your code in a situation where a different parser is selected. You’d need to write logic to accommodate the possibility, which can be rather cumbersome when mixed into typical SAX-style state machine logic.
The class text_normalize_filter
presented in
this recipe ensures that all text events are reported to downstream
SAX handlers in the contiguous manner that most developers would
expect. In this recipe’s example case, the filter would consolidate
the three characters events into a single one for the entire text node
'abc
‘.
For more information on SAX filters in general, see my article “Tip: SAX filters for flexible processing,” http://www-106.ibm.com/developerworks/xml/library/x-tipsaxflex.html.
Python’s XMLGenerator
does
not do anything with processing instructions, so, if you run the main
code presented in this recipe on an XML document that uses them,
you’ll have a gap in the output, along with other minor deviations
between input and output. Comments are similar but worse, because
XMLFilterBase
does not even filter
them; if you do need to get comments, your
test_normalize_filter
class must multiply inherit
from xml.sax.saxlib.LexicalHandler
,
as well as from xml.sax.saxutils.XMLFilterBase
, and it must
override the parse
method as
follows:
def parse(self, source): # force connection of self as the lexical handler self._parent.setProperty(property_lexical_handler, self) # Delegate to XMLFilterBase for the rest XMLFilterBase.parse(self, source)
This code is hairy enough, using the “internal” attribute
self._parent
, and the need to deal
properly with XML comments is rare enough, to make this addition
somewhat doubtful, which is why it is not part of this recipe’s
Solution.
If you need ease of chaining to other filters, you may prefer
not to take both upstream and downstream parameters in _ _init_ _
. In this case, keep the same
signature as XMLFilterBase._ _init_
_
:
def _ _init_ _(self, parent): XMLFilterBase._ _init_ _(self, parent) self._accumulator = [ ]
and change the _wrap_complete
factory function
so that the wrapper, rather than calling methods on the downstream
handler directly, delegates to the default implementations in XMLFilterBase
, which in turn call out to
handlers that have been set on the filter with such methods as
setContentHandler
and the
like:
def _wrap_complete(method_name): def method(self, *a, **k): self._complete_text_node( ) getattr(XMLFilterBase, method_name)(self, *a, **k) # 2.4 only: method._ _name_ _ = method_name setattr(text_normalize_filter, method_name, method)
This is slightly less convenient for the typical simple case, but it pays back this inconvenience by letting you easily chain filters:
parser = sax.make_parser( ) filtered_parser = text_normalise_filter(some_other_filter(parser))
as well as letting you use a filter in contexts that call the
parse
method on your behalf:
doc = xml.dom.minidom.parse(input_file, parser=filtered_parser)
Library Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library.
Credit: Bill Bell
Your Python application, running on Windows, needs to use the Microsoft MSHTML COM component, which is also the parser that Microsoft Internet Explorer uses to parse HTML and XML web pages.
As usual, PyWin32 lets our Python code access COM quite simply:
from win32com.client import Dispatch html = Dispatch('htmlfile') # the disguise for MSHTML as a COM server html.writeln( "<html><header><title>A title</title>" "<meta name='a name' content='page description'></header>" "<body>This is some of it. <span>And this is the rest.</span>" "</body></html>" ) print "Title: %s" % (html.title,) print "Bag of words from body of the page: %s" % (html.body.innerText,) print "URL associated with the page: %s" % (html.url,) print "Display of name:content pairs from the metatags: " metas = html.getElementsByTagName("meta") for m in xrange(metas.length): print " %s: %s" % (metas[m].name, metas[m].content,)
While Python offers many ways to parse HTML or XML, as long as
you’re running your programs only on Windows, MSHTML is very speedy
and simple to use. As the recipe shows, you can simply use the
writeln
method of the COM object to
feed the page into MSHTML and then you can use the methods and
properties of the components to get at all kinds of aspects of the
page’s DOM. Of course, you can get the string of markup and text to
feed into MSHTML in any way that suits your application, such as by
using the Python Standard Library module urllib
if you’re getting a page from some
URL.
Since the structure of the enriched DOM that MSHTML makes available is quite rich and complicated, I suggest you experiment with it in the PythonWin interactive environment that comes with PyWin32. The strength of PythonWin for such exploratory tasks is that it displays all of the properties and methods made available by each interface.
A detailed reference to MSHTML, albeit oriented to Visual Basic and C# users, can be found at http://www.xaml.net/articles/type.asp?o=MSHTML.