In this example, we look at how we can extract and use information from an XML document using SAX. The particular documents our script works with are simple news articles, but we’ll see how to work with elements, attributes, and textual content.
Some of the trade-offs of using SAX depend on what you’re trying to accomplish, and how the XML is structured. SAX treats XML as a continuous stream, firing events to your handler as they happen. Example 3-1 shows article.xml.
<?xml version="1.0"?> <webArticle category="news" subcategory="technical"> <header title="NASA Builds Warp Drive" length="3k" author="Joe Reporter" distribution="all"/> <body>Seattle, WA - Today an anonymous individual announced that NASA has completed building a Warp Drive and has parked a ship that uses the drive in his back yard. This individual claims that although he hasn't been contacted by NASA concerning the parked space vessel, he assumes that he will be launching it later this week to mount an expedition to the Andromeda Galaxy. </body> </webArticle>
Example 3-1 contains markup that is structured in a few different ways, and can be interesting to parse via SAX. A document such as article.xml requires that we understand how the document is structured prior to writing a handler to parse it. Therefore, the handler is tightly coupled to the document’s structure.
You can write the ArticleHandler
class to a new file,
handlers.py; we’ll keep adding
new handlers to this file throughout the chapter. Keep it simple at
first, just to see how SAX works:
from xml.sax.handler import ContentHandler class ArticleHandler(ContentHandler): """ A handler to deal with articles in XML """ def startElement(self, name, attrs): print "Start element:", name
Now we need to create a script to instantiate the parser, assign the handler, and do the actual work.
No matter how complex your handler objects become, there is
rarely much code involved in setting up the parser. Let’s look at
Example 3-2, in which we
use only the ArticleHandler
class
just created, and parse what we find on the standard input stream. The
file art.py, shown in Example 3-2, demonstrates how to
do this.
#!/usr/bin/env python # art.py import sys from xml.sax import make_parser from handlers import ArticleHandler ch = ArticleHandler( ) saxparser = make_parser( ) saxparser.setContentHandler(ch) saxparser.parse(sys.stdin)
Once created, you can run the code from the command line using file redirection to populate standard input (both Unix and Windows):
$> python art.py < article.xml
The output using the simple article handler appears as:
Start element: webArticle Start element: header Start element: body
The output reflects the simple rule in your ArticleHandler
class, which just prints out
the name of each tag it encounters. To really use the XML, you have to
add more functionality to the handler class in the handlers.py file.
XML allows information to be parsed for different purposes. If you create a news article in XML, one application can grab it and display it as HTML, while another can index it to a search database. It’s easy to imagine that a service might like to offer intelligent agents to scour Internet sources for news items, special offers, and other items of interest for you based on preferences that you set up. XML makes this process manageable, as opposed to the alternative of reliably parsing HTML for structured information, which is nearly impossible. HTML only communicates the appearance of a document and not its organizational structure. In HTML, two documents may look exactly alike in the browser, but use wildly different tags under the hood. Parsing the HTML for its information won’t work, unless of course the page designer had that goal in mind when setting out to create the page.
Your news agent is configured to go after technology stories, especially ones that relate to space travel. When it discovers such an article, it displays a message, the headline, and the first few words of the body text. You can add functionality to your handler class to support this.
Since SAX is stream-based, it’s sometimes necessary to set flags so that you can track when you’ve entered certain elements in and when you haven’t. If you find that you’re setting too many different flags, you might consider using a DOM approach as opposed to SAX. SAX is perfect when doing bulk operations on a lengthy XML stream. However, if you are trying to pull a complex data structure out of the document, you may be better off using the DOM.
To keep our example simple, set a few flags as the
events are propagated, and go after the desired information. In the
startElement
method, check to see
if you’re indeed inside a news article and if your article is indeed
technical. If it satisfies both of these requirements, change a
Boolean data member so that other methods start paying attention to
the data they receive. Also set a property on the handler itself so
that the main application knows the handler has found a technical
article, as that was its assignment:
def startElement(self, name, attrs): if name == "webArticle": subcat = attrs.get("subcategory", "") if subcat.find("tech") > -1: self.inArticle = 1 self.isMatch = 1 elif self.inArticle: if name == "header": self.title = attrs.get("title", "") if name == "body": self.inBody = 1
The last conditional test is to see if the parser has entered
the body element of a relevant article. If so, the characters
method now knows to begin
buffering data as the it is called:
def characters(self, characters): if self.inBody: if len(self.body) < 80: self.body += characters if len(self.body) > 80: self.body = self.body[:78] + "..." self.inBody = 0
Finally, look for the close of the body tag to indicate to the
characters
method that it no longer
needs to pay attention to character data:
def endElement(self, name): if name == "body": self.inBody = 0
Beyond implementing these three methods, the class is also
modified to initialize data members, and to provide an isMatch
data member to indicate to the main
application whether this handler has found something worth keeping.
The complete class (replacing the earlier class of the same name) is
shown in Example
3-3.
from xml.sax.handler import ContentHandler class ArticleHandler(ContentHandler): """ A handler to deal with articles in XML """ inArticle = 0 inBody = 0 isMatch = 0 title = "" body = "" def startElement(self, name, attrs): if name == "webArticle": subcat = attrs.get("subcategory", "") if subcat.find("tech") > -1: self.inArticle = 1 self.isMatch = 1 elif self.inArticle: if name == "header": self.title = attrs.get("title", "") if name == "body": self.inBody = 1 def characters(self, characters): if self.inBody: if len(self.body) < 80: self.body += characters if len(self.body) > 80: self.body = self.body[:78] + "..." self.inBody = 0 def endElement(self, name): if name == "body": self.inBody = 0
Now that the handler has been modified to collect more information and determine if the article is interesting, we can add a little more code to art.py so that when an interesting article is found, it prints a report for the user and ignores everything else. To do this, we need only append this code to the end of art.py, which was originally shown in Example 3-2:
if ch.isMatch: print "News Item!" print "Title:", ch.title print "Body:", ch.body
With article.xml as input, you should see the following output:
$> python art.py
< article.xml
News Item!
Title: NASA Builds Warp Drive
Body: Seattle, WA - Today an anonymous individual
announced that NASA has completed building a...