Chapter 17. XPath primer Friendly Tutorial

  • Location paths

  • Addressing multiple objects

  • Children and descendants

  • Attributes

  • Predicates

XPath is a notation for addressing information within a document. That information could be:

  • An “executive summary” of a longer document.

  • A glossary of terms whose definitions are scattered throughout a manual.

  • The specific sequence of steps, buried in a large reference work, needed to solve a particular problem.

  • The customized subset of information that a particular customer subscribes to.

  • All the sections and subsections of a book that were written by a particular author or revised since a specific date.

  • For documents holding information from relational databases, all the typical queries made of relational databases: a particular patient’s medical records, the address of the customer with the most orders, the inventory items with low stock levels, and so on.

  • For documents that are containers for document collections, all the typical queries made in a library catalog or on a website: articles about Abyssinian cats, essays on the proper study of mankind, etc.

A programmer working with an XML-aware programming or scripting language could write code to search the document for the information that meets the specified criteria. The purpose of XPath is to automate this searching so that a non-programming user can address the information just by writing an expression that contains the criteria.

Location paths

In order to retrieve something, you need to know where to find it – in other words, its address.

An address doesn’t have to be an absolute location; you can address things relatively (“two doors down from 29 Jones St.”). It doesn’t have to be an explicit location at all: you can address things by name (“Lance”) or description (“world’s greatest athlete”).

You can address several things at once (“Monty and Westy”), even if you don’t know exactly what they are or whether they even exist (“inexpensive French restaurant downtown”).

All those forms of address can be used to locate things in XML documents, by means of an XPath expression.

The most important form of XPath expression is called a location path. You may already be familiar with location paths because they are used to address your computer’s files by specifying the path from the file system’s root to a specific subdirectory. For example, the path /home/bob/xml/samples identifies a particular one of the four samples subdirectories shown in Figure 17-1.

File system directory structure

Figure 17-1. File system directory structure

Addressing multiple objects

A location path is capable of addressing multiple objects. For example, the expression in Example 17-1 addresses all caption elements within figure elements that are within chapter elements within book elements.

Example 17-1. Location path

/book/chapter/figure/caption

In the book whose structure is shown in Figure 17-2, the expression in Example 17-1 would address the first two caption elements, because they are children of figure elements. It would not address the third, which is the child of an example element.

Document structure

Figure 17-2. Document structure

Example 17-2 shows the XML representation of the book.

Example 17-2. A short book in XML

<?xml version="1.0"?>
<book>
  <chapter>
    <par author="bd">First paragraph. <emph>Really.</emph></par>
    <par author="cg">Second paragraph.</par>
    <figure picfile="one.jpg">
      <caption>The first figure's caption</caption>
    </figure>
    <par>Third paragraph.</par>
    <figure picfile="two.jpg">
      <caption>The second figure's caption.</caption>
    </figure>
  </chapter>
  <chapter>
    <par author="pp">Chapter 2, first paragraph.</par>
    <example>
      <caption>The first example.</caption>
    </example>
    <par>Chapter 2, second paragraph.</par>
    <par author="bd">Chapter 2, third paragraph.</par>
  </chapter>
</book>

Children and descendants

The /book/chapter/figure/caption expression addresses two elements with no children other than data. The expression /book/chapter/figure, however, addresses the figure elements with their caption children, as shown in Example 17-3.

Example 17-3. Figure elements with their caption children

<figure picfile="one.jpg">
  <caption>The first figure's caption</caption>
</figure>
<figure picfile="two.jpg">
  <caption>The second figure's caption.</caption>
</figure>

The location path /book addresses the entire document.

In a location path, the slash character (/) means “child of”. A double slash (//) means “descendant of,” which is more flexible as it includes children, grandchildren, great-grandchildren, and so on. For example, /book//caption addresses any caption element descended from a book element. In the book shown in Figure 17-2 it would address the example element’s caption from the book’s second chapter along with the figure elements’ two caption elements:

Example 17-4. Caption elements descended from the book element

<caption>The first figure's caption</caption>
<caption>The second figure's caption.</caption>
<caption>The first example.</caption>

Attributes

An address in XML document navigation is not a storage address like a file system path, despite the similarity in syntax. An XPath expression locates objects by their position in a document’s structure and other properties, such as the values of attributes.

A diagram like Figure 17-2 should not present an attribute’s information as a child of the element exhibiting the attribute. To do so would be incorrect, because attributes are not siblings of subelements. For this reason, XPath uses /@ to show the element/attribute relationship.

For example, the expression in Example 17-5 addresses all the values of the par elements’ author attributes.

Example 17-5. Expression with an attribute

/book//par/@author

Example 17-6. Objects addressed by Example 17-5

bd
cg
pp
bd

Predicates

A predicate is an expression that changes the group of objects addressed by another expression that precedes it. A predicate expression is delimited by square brackets and is either true or false. If true, it adds to the objects addressed; if false, it removes objects.

For example, the expression in Example 17-7 addresses all chapters that have a figure element in them.

Example 17-7. Addressing chapters with a figure element

/book/chapter[figure]

The predicate expression figure is true for any chapter that contains a figure. If true, that chapter is included among the addressed objects. Note that the figure itself is not among the objects addressed (although it is contained within the addressed chapter object).

A predicate expression can be a comparison. For example, .= lets you address an element by comparing its content data to a specific character string. Example 17-8 uses this technique to address all the par elements that have “Second paragraph”. as their content data.

Example 17-8. Addressing par elements with specific content data

/book//par[.="Second paragraph."]

Example 17-9. Objects addressed by Example 17-8

<par author="cg">Second paragraph.</par>

The XPath data model

We have been talking informally about the “objects” that XPath addresses, but what are they?

One thing they certainly are not is the unparsed text of an XML document. That’s because there are several alternative text strings that could mean the same thing, so addressing them would be ambiguous.

For example, an apostrophe in data and the predefined entity reference &apos; are equivalent to an XML parser. XPath addresses the result of parsing.

Computer scientists call the structure in Figure 17-2 a tree, even though it seems to be growing upside down! In fact, they usually speak of it as a family tree, as we have been doing, with ancestors, children, descendants, siblings, and so on.

The objects in a computer science tree are called nodes, and nodes are the objects that XPath addresses. There are nodes that represent elements, textual data, attributes, and other things found by the parser.

Tip

Tip

An advanced tutorial on XPath can be found in Chapter 24, “XML Path Language (XPath)”, on page 498.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset