The most commonly used type of XPath expression is the location path. A location path can be thought of as similar to a path for a file on a disk, but on steroids. Where a path for a filesystem contains only names of directories and a file, an XPath location path can specify much more. At each step along the path, it can perform selection based on complex tests of the nodes in a document, and the result may be several nodes. The tests, or predicates, for each step of the path can match based on element name, attribute presence or value, or textual content.
The full syntax of location paths is complex, but the specification is considerate enough to define abbreviated forms for the most commonly used tests; these are called abbreviated location paths. All of the location paths we describe in this chapter use the abbreviated syntax; for more information on the full syntax and selection capabilities of XPath, please refer to the specification.
Location paths are used within XSLT elements, but may also be used programmatically with an XPath API to return node sets from an XML document at runtime. The latter technique will come into greater focus as you read this chapter; the former is covered in Chapter 6.
Let’s start with an example document that represents data records. The records are all fairly similar, but of course the field values are different in each one. This is typical of the type of documents you might mine with XPath. In Example 5-1, we apply XPath expressions against an XML document representing starships from some popular science-fiction television series.
<?xml version="1.0" encoding="UTF-8"?> <shiptypes name="United Federation of Planets"> <ship name="USS Enterprise"> <class>Sovereign</class> <captain>Jean-Luc Picard</captain> <registry-code>NCC-1701-E</registry-code> </ship> <ship name="USS Voyager"> <class>Intrepid</class> <captain>Kathryn Janeway</captain> <registry-code>NCC-74656</registry-code> </ship> <ship name="USS Enterprise"> <class>Galaxy</class> <captain>Jean-Luc Picard</captain> <registry-code>NCC-1701-D</registry-code> </ship> <ship name="USS Enterprise"> <class>Constitution</class> <captain>James T. Kirk</captain> <registry-code>NCC-1701</registry-code> </ship> <ship name="USS Sao Paulo"> <class>Defiant</class> <captain>Benjamin L. Sisko</captain> <registry-code>NCC-75633</registry-code> </ship> </shiptypes>
The ships.xml file
provides a good stretch of XML data to write paths against. Now you
can write a small program to apply path expressions to the document,
and report on the nodes that are returned. In Example 5-2, we create a small
script, xp.py, which invokes the
xml.xpath.Evaluate
function
provided with 4Suite and more recent versions of PyXML.
""" xp.py (requires xml doc on stdin) """ import sys from xml.dom.ext.reader import PyExpat from xml.xpath import Evaluatepath0 = "ship/captain"
# all captain elements reader = PyExpat.Reader( )dom = reader.fromStream(sys.stdin)
captain_elements = Evaluate(path0, dom.documentElement)
for element in captain_elements: print "Element: ", element
To run this program, you need to supply the previously created ships.xml from Example 5-1 as input:
$ python xp.py < ships.xml
In Example 5-2, the
path ship/captain
is used to
extract all captain elements from the ships.xml document. The result is a node
list containing the following:
<captain>Jean-Luc Picard</captain> <captain>Kathryn Janeway</captain> <captain>Jean-Luc Picard</captain> <captain>James T. Kirk</captain> <captain>Benjamin L. Sisko</captain>
Of course, this is not a complete or standalone document, but rather a node list. These nodes are processed by the remaining code in the program:
captain_elements = Evaluate(path0, dom.documentElement) for element in captain_elements: print "Element: ", element
The path ship/captain
is a
relative location path, as it does not specify an exact location from
the root of the document to the element, as does /shiptypes/ship/captain
. The ship/captain
expression returns captain
elements that are children of a
ship
element, relative to the
document node passed to Evaluate
.
You will often want to target text beneath an element.
For example, you may want to search just for the captain’s name,
rather than the element node. You could append the XPath text
function to your expression:
path1 = "ship/captain/text( )"
This addition to the path expression selects all text nodes
beneath the captain
element. If you
replace the original production lines with the following code:
captainnodes = Evaluate(path1, dom.documentElement) for captainnode in captainnodes: print "Starfleet Captain: ", captainnode.nodeValue
you see the following result:
$ python xp.py < ships.xml Starfleet Captain: Jean-Luc Picard Starfleet Captain: Kathryn Janeway Starfleet Captain: Jean-Luc Picard Starfleet Captain: James T. Kirk Starfleet Captain: Benjamin L. Sisko
Often, when working with data, you become interested in the ordinal positions of elements within columns, rows, or arrays. XML is no different in this regard. XPath provides indexed elements with syntax similar to array indexes, but it is important to know that XPath indexes are one-based, while Python sequence indexes are zero-based. To target an element using an index, use brackets next to the element name:
path2 = "ship[2]/captain/text( )"
In this case, ship[2]
indicates that the second ship element for each parent of any ship
element should have the text nodes
beneath its captain
element
selected. To see the output, change the processing code:
capnode = Evaluate(path2, dom.documentElement) print "Captain of ship[2] is: ", capnode[0].nodeValue
Using path2
, the output
is:
$ python xp.py < ships.xml Captain of ship[2] is: Kathryn Janeway
It is important not to allow the visual similarity between
ship[2]
and Python sequence
indexing to confuse you; they are very different. The notation is
actually shorthand for ship[position(
)=2]
, which indicates that the second ship
child element of some other element
will match. Consider the following XML fragment:
<fleet name="Atlantic"> <ship id="id1"/> <lifeboat id="id2"/> </fleet> <fleet name="Pacific"> <lifeboat id="id3"/> <ship id="id4"/> <ship id="id5"/> </fleet>
The XPath expression ship[2]
matches only the ship
element with
an id
attribute of id5
. This is not a trick, but it is an
excellent reason to keep a copy of the XPath specification close
by.
You may also want to query the text content beneath an
element name. Say you have a structure of book chapters, each
containing headings and paragraphs. You may want to search for text
that appears underneath a certain heading. XPath provides a convenient
way for you to check the character data of a text node that is the
child of an element. If you are searching for a <ship>
element with a <class>
element beneath it that
contains the word Intrepid
, you
could use the following path:
path3 = 'ship[class="Intrepid"]'
This expression selects ship
elements that have a child class
element with child character data of Intrepid
. You can further explore the
returned node list with a processing code:
shipnodes = Evaluate(path3, dom.documentElement) for shipnode in shipnodes: shipname = shipnode.getAttribute("name") captain = Evaluate("captain/text( )", shipnode) print "------------ Intrepid Class Ship ------------" print "Name: ", shipname print "Captain: ", captain[0].nodeValue
In this code, we select all ship nodes that have a child
class
element indicating that they
are Intrepid
class ships. We can
then reprocess this node to further select ship names and captains to
generate the following output:
$ python xp.py < ships.xml ------------ Intrepid Class Ship ------------ Name: USS Voyager Captain: Kathryn Janeway
Instead of just checking that a descendent element contains
necessary information as in path3
,
you can continue building the path expression to grab something
specific beneath the matching element:
path4 = 'ship[class="Constitution"]/@name'
In this path, you drill down further. First, a ship
element is selected only if its child
class
element contains the
character data Constitution
. This
path is further extended when we select the name
attribute of the ship element that
contains the specific child character data (the @
symbol is used to indicate that we’re
interested in an attribute rather than a child element). Again, we
change the processing code a little to use the new node list:
ship = Evaluate(path4, dom.documentElement) print "Name of Constitution Class Ship: ", ship[0].nodeValue
The output follows:
$ python xp.py < ships.xml Name of Constitution Class Ship: USS Enterprise
Of course, evaluating XML attributes and their contents
involves a slightly different process than evaluating element names
and text node character data. In XPath, the @
character is used to indicate an
attribute. Brackets are also used to surround the node when it is
being tested against character data. In order to test the character
contents of an attribute, use a path such as the following:
path5 = 'ship[@name="USS Enterprise"]'
This expression selects all ship elements that have a name attribute containing the word Enterprise. In your ships.xml file, there are three starships named Enterprise, each with slightly different registry codes. You can mine the node list for more information:
ships = Evaluate(path5, dom.documentElement) for shipnode in ships: registry = Evaluate("registry-code/text( )", shipnode) captain = Evaluate("captain/text( )", shipnode) print "Found Enterprise with registry: ", registry[0].nodeValue print "Captain: ", captain[0].nodeValue
These subsequent expressions are relative paths that select
captain
and registry-code
text from the current element
with each hop through the node list. This time using the preceding
code, the output appears as:
$ python xp.py < ships.xml Found Enterprise with registry: NCC-1701-E Captain: Jean-Luc Picard Found Enterprise with registry: NCC-1701-D Captain: Jean-Luc Picard Found Enterprise with registry: NCC-1701 Captain: James T. Kirk
As with any ordered data set, you are usually interested
in pulling one specific type of information out from the entire
document. You may only be interested in the names of employees in a
human resources database. Or you may have heavily nested data that you
want to make sure you pull out with each occurrence of a given data
type, regardless of its position in the document. With XPath, you can
use the path expression //
to
indicate that all matching elements beneath the root should be
selected:
path6 = "/shiptypes//captain"
This expression selects all captain elements beneath the route, regardless of where they appear. Since you are working with elements, obtaining character data requires some of the work shown earlier, or a traversal of the node structure:
captains = Evaluate(path6, dom.documentElement) for captain in captains: print "Captain: ", captain.firstChild.nodeValue
Running path6
generates the
following output:
$ python xp.py < ships.xml Captain: Jean-Luc Picard Captain: Kathryn Janeway Captain: Jean-Luc Picard Captain: James T. Kirk Captain: Benjamin L. Sisko
If you are familiar with filesystem paths on Windows or
Unix, you may have seen the .
and
..
operators. The .
operator indicates the current directory
(or current element in XPath) while ..
refers to the parent directory (or parent
element in XPath). Using ships.xml, shown in Example 5-1, we can search for a
specific ship’s name and then reference the parent element to see
which organization the ship belongs to.
path7 = "ship[@name='USS Voyager']/../@name"
This expression searches for a ship element that has a name
attribute of “USS Voyager.” The path
then continues to select the name
attribute of this ship element’s parent. In ships.xml, this is the name attribute of
the shiptypes
element. To generate
output, change your processing code in xp.py:
org = Evaluate(path7, dom.documentElement) print "USS Voyager is owned by", org[0].nodeValue
This time xp.py generates output attributing the Voyager to the Federation of Planets:
$ python xp.py < ships.xml USS Voyager is owned by United Federation of Planets