C H A P T E R  14

XML

Extensible Markup Language (XML) is a very powerful tool for data storage and transfer. When a document is written in XML, it can be universally understood and exchanged. The utility of XML being a worldwide standard should not be underestimated. XML is used for modern word processor documents, SOAP and REST web services, RSS feeds, and XHTML documents.

In this chapter, we will primarily cover the PHP SimpleXML extension, which makes it very easy to manipulate XML documents. We will also touch on the Document Object Model (DOM) and XMLReader extensions. The DOM guarantees that the document is viewed the same no matter what computer language is using it. The main reasons that more than one library for parsing and writing XML exists are ease of use, depth of functionality, and the manner in which the XML is manipulated.

XML Primer

XML allows us to define documents that use any tag elements or attributes we want. When viewing an XML document in a text editor, you may notice that it resembles HTML. This is because, like HTML (HyperText Markup Language), XML is a markup language, containing a collection of tagged content in a hierarchical structure. The hierarchical structure is tree-like, having a single root element (tag) acting as the trunk, child elements branching off of the root, and further descendants branching off of their parent elements. You can also view an XML document in order as a series of discrete events. Notice that viewing the elements in order requires no knowledge of what the entire document represents, but also makes searching for elements more difficult.

A specific example of an XML ”application” is XHTML. XHTML is similar to HTML in the fact that the same tags are used. However XHTML also adheres to XML standards and so is more strict. XHTML has the following additional requirements:

  • Tags are case sensitive. In XHTML element names need to always be lowercase.
  • Single elements, such as <br>, need to be closed off. In this case, we would use <br />.
  • The entities &, <, >, ', " need to be being escaped as &amp;, &lt;, &gt;, &apos; and &quot; respectively
  • Attributes need enclosing quotations. For example, <img src=dog.jpg /> is illegal while <img src="dog.jpg" /> is legal.

To parse XML, we can use a tree-based or event driven model. Tree-based models like those used in SimpleXML and the DOM represent HTML and XML documents as a tree of elements and load the entire document into memory. Each element, except the root, has a parent element. Elements may contain attributes and values. Event-based models such as the Simple API for XML (SAX) read only part of the XML document at a time.  For large documents, SAX is faster; for extremely large documents, it can be the only viable option. However, tree-based models are usually easier and more intuitive to work with and some XML documents require that the document be loaded all at once.

A basic XML document can look like the following:

<animal>
    <type id="9">dog</type>
    <name>snoopy</name>
</animal>

The root element is <animal> and there are two child elements, <type> and <name>. The value of the <type> element is “dog” and the value of the <name> element is “snoopy.” The <type> element has one attribute, id, which has a value of “9.” Furthermore, each opening tag has a matching closing tag and the attribute value is enclosed in quotations.

Schemas

An XML schema provides additional constraints on an XML document. Examples of constraints are specific elements that are optional or must be included, the acceptable values and attributes of an element, and where elements can be placed.

Without a schema, there is nothing preventing us from having nonsensical data like what you see in Listing 14-1.

Listing 14-1. Example Showing the Need for a Stricter Schema

<animals>
  <color>black</color>  
  <dog>
    <name>snoopy</name>
    <breed>
      <cat>
        <color>brown</color>
        <breed>tabby</breed>
      </cat>
      beagle cross
    </breed>
 </dog>
 <name>teddy</name>  
</animals>

This document does not make much sense to humans. A cat cannot be part of a dog. color and name are not animals and should be enclosed within a dog or cat element. However, from a machine perspective, this is a perfectly valid document. We have to tell the machine the reasons why this document is not acceptable. A schema allows us to inform the machine how to enforce how data is laid out. This added rigidity ensures more integrity of data in the document. With a schema we can explicitly say that <cat> tags cannot go inside of <dog> tags. We can also say that <name> and <color> tags can only go directly inside of <cat> or <dog> tags.

The three most popular schema generation languages are the Document Type Definition (DTD), XML Schema, and RELAX NG (REgular LAnguage for XML Next Generation). As this book is focused on PHP, we will not go over creating a schema, but simply mention that you declare the schema at the start of your document. See Listing 14-2.

Listing 14-2. Code Snippet Showing the Declaration of Using the xhtml1-transitional Schema

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

SimpleXML

SimpleXML makes it easy to store XML as a PHP object and vice versa. SimpleXML simplifies traversal of the XML structure and finding specific elements. The SimpleXML extension requires PHP 5 or higher and is enabled by default.

Parsing XML from a String

Let us dive right in to our first example. We will load XML that is in a string into a SimpleXMLElement object and traverse the structure. See Listing 14-3.

Listing 14-3. First Example: animal.php

<?php

error_reporting(E_ALL ^ E_NOTICE);

$xml = <<<THE_XML
  <animal>
    <type>dog</type>
    <name>snoopy</name>
  </animal>     
THE_XML;

//to load the XML string into a SimpleXMLElement object takes one line
$xml_object = simplexml_load_string($xml);

foreach ($xml_object as $element => $value) {
    print $element . ": " . $value . "<br/>";
}

?>

After the XML string is loaded in Listing 14-3, $xml_object is at the root element, <animal>. The document is represented as a SimpleXMLElement object, so we can iterate through the child elements using a foreach loop. The output of Listing 14-3 is the following:

  type: dog
  name: snoopy

Listing 14-4. A More Complex Example: animals.php

<?php
error_reporting(E_ALL ^ E_NOTICE);

$xml = <<<THE_XML
<animals>
  <dog>
    <name>snoopy</name>
    <color>brown</color>
    <breed>beagle cross</breed>
  </dog>
  <cat>
    <name>teddy</name>
    <color>brown</color>
    <breed>tabby</breed>
  </cat>
  <dog>
    <name>jade</name>
    <color>black</color>
    <breed>lab cross</breed>
  </dog>
</animals>
THE_XML;

$xml_object = simplexml_load_string($xml);

//output all of the dog names
foreach($xml_object->dog as $dog){
   print $dog->name."<br/>";
}
?>

The output of Listing 14-4 is the following:

snoopy
jade

Most of Listing 14-4 is using PHP heredoc syntax to load a string in a readable fashion. The actual code involved to find the element values was a few lines . Simple indeed. SimpleXML is smart enough to iterate over all the <dog> tags, even with a <cat> tag between <dog> tags.

Parsing XML from a File

When you are loading XML, if the document is invalid then PHP will complain with a helpful warning message. The message could inform that you need to close a tag or escape an entity, and will indicate the line number of the error. See Listing 14-5.

Listing 14-5. Sample PHP Warning Message for Invalid XML

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 1: parser
error : attributes construct error in E:xampphtdocsxmlanimals.php on line 29

Our next two examples will load in XML from a file that is shown in Listing 14-6. Some of the XML elements have attributes. We will show in Listing 14-7 how to find attribute values naively, by using repeated SimpleXML functional calls. Then in Listing 14-10, we will show how to find the attribute values by using XPath, which is meant to simplify searches.

Listing 14-6. Our Sample XHTML File: template.xhtml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
    <body>
        <div id="header">
            header would be here
        </div>
        <div id="menu">
            menu would be here
        </div>
        <div id="main_content">
            <div id="main_left">
                left sidebar
            </div>
            <div id="main_center" class="foobar">
                main story
            </div>
            <div id="main_right">
                right sidebar
            </div>
        </div>
        <div id="footer">
            footer would be here
        </div>
    </body>
</html>

The first two lines of Listing 14-6 define the version of XML used and the DOCTYPE and are not part of the tree loaded into the SimpleXMLElement. So the root is the <html> element.

Listing 14-7 shows how to find the content of the <div> with id="main_center" using object-oriented SimpleXML methods.

Listing 14-7. Finding a Specific Value Based on an Attribute

<?php
error_reporting(E_ALL ^ E_NOTICE);

$xml = simplexml_load_file("template.xhtml");
findDivContentsByID($xml, "main_center");

function findDivContentsByID($xml, $id) {
  foreach ($xml->body->div as $divs) {
      if (!empty($divs->div)) {
          foreach ($divs->div as $inner_divs) {
              if (isElementWithID($inner_divs, $id)) {
                  break 2;
              }
          }
      } else {
          if (isElementWithID($divs, $id)) {
              break;
          }
      }
  }
}

function isElementWithID($element, $id) {
    $actual_id = (String) $element->attributes()->id;
    if ($actual_id == $id) {
        $value = trim((String) $element);
        print "value of #$id is: $value";
        return true;
    }
    return false;
}
?>

Listing 14-7 will find all of the <div> elements of the <body> element and also direct child <div> elements of those <div> elements. Then each matching <div> element has its id attribute compared with our id search value, "main_center." If they are equal, then we print out the value and break from the loop. The output of this script is as follows:

value of #main_center is: main story

We can not simply output $element in our isElementWithID function because we will output the entire SimpleXMLElement object.

object(SimpleXMLElement)[9]
  public '@attributes' =>
    array
      'id' => string 'main_center' (length=11)
      'class' => string 'foobar' (length=6)
  string '
                main story
            ' (length=40)

So we need to cast the return value from an Object into a String. (Recall that casting explicitly converts a variable from one data type into another). Notice also that whitespace is captured in the element value, so we may need to use the PHP trim() function on our string.

To get the attributes of an element, SimpleXML has the attributes() function which returns an object of attributes.

var_dump($element->attributes());
object(SimpleXMLElement)[9]
  public '@attributes' =>
    array
      'id' => string 'main_center' (length=11)
      'class' => string 'foobar' (length=6)

We also need to cast the return value of $element->attributes()->id; or we will again get an entire SimpleXMLElement object back.

Listing 14-7 is not robust. If the structure of the document changes or is deeper than two levels, it will fail to find the id.

You may recognize that XHTML documents follow the familiar Document Object Model, or DOM, of HTML. Existing parsers and traversal utilities like XPath and XQuery make finding nested elements relatively easy. XPath is part of both the SimpleXML library and the PHP DOM library. With SimpleXML, you invoke XPath through a function call, $simple_xml_object->xpath(). In the DOM library you use XPath by creating an object, DOMXPath and then calling the object’s query method.

We will show how to find a specific id attribute with XPath in Listing 14-10. First we will show how to find the elements we retrieved in Listings 14-3 and 14-4 using XPath. See Listing 14-8.

Listing 14-8. Finding an Element Using XPath

<?php

error_reporting(E_ALL);

$xml = <<<THE_XML
  <animal>
    <type>dog</type>
    <name>snoopy</name>
  </animal>     
THE_XML;

$xml_object = simplexml_load_string($xml);

$type = $xml_object->xpath("type");
foreach($type as $t) {
    echo $t."<br/><br/>";
}

$xml_object = simplexml_load_string($xml);
$children = $xml_object->xpath("/animal/*");
foreach($children as $element) {
    echo $element->getName().": ".$element."<br/>";
}
?>

The output of Listing 14-8 is:

dog

type: dog
name: snoopy

In the first part of Listing 14-8, we select the <type> inner element of <animal> using the XPath selector "type". This returns an array of SimpleXMLElement objects that match the XPath query. The second part of the listing selects all child elements of <animal> using the XPath selector "/animal/*", where the asterisk is a wildcard. As SimpleXMLElement objects are returned from the xpath() call, we can also output the element names by using the getName() method.

images Note The complete specification covering XPath selectors can be viewed at www.w3.org/TR/xpath/.

Listing 14-9 shows how to match a specific child element regardless of the parent type. It also demonstrates how to find the parent element of a SimpleXMLElement.

Listing 14-9. Matching Children and Parents Using XPath

<?php

error_reporting(E_ALL ^ E_NOTICE);

$xml = <<<THE_XML
<animals>
  <dog>
    <name>snoopy</name>
    <color>brown</color>
    <breed>beagle cross</breed>
  </dog>
  <cat>
    <name>teddy</name>
    <color>brown</color>
    <breed>tabby</breed>
  </cat>
  <dog>
    <name>jade</name>
    <color>black</color>
    <breed>lab cross</breed>
  </dog>
</animals>
THE_XML;

$xml_object = simplexml_load_string($xml);

$names = $xml_object->xpath("*/name");
foreach ($names as $element) {
    $parent = $element->xpath("..");
    $type = $parent[0]->getName();
    echo "$element ($type)<br/>";
}
?>

The output of Listing 14-9 will be this:

snoopy (dog)
teddy (cat)
jade (dog)

We have matched the <name> element, regardless of whether it is contained in a <dog> or <cat> element with the XPath query "*/name". To get the parent of our current SimpleXMLElement, we used the query "..". We could of instead used the query "parent::* ".

Listing 14-10. Matching an Attribute Value Using XPath

<?php

error_reporting(E_ALL);

$xml = simplexml_load_file("template.xhtml");
$content = $xml->xpath("//*[@id='main_center']");
print (String)$content[0];

?>

In Listing 14-10, we used the query "//*[@id='main_center'] " to find the element with attribute id equal to 'main_center'. To match an attribute with XPath, we use the @ sign. Compare the simplicity of Listing 14-10 which uses XPath with that of Listing 14-7.

Namespaces

XML namespaces define what collection an element belongs to, preventing data ambiguity. This is important and can otherwise occur if you have distinct node types containing elements with the same name. For example, you could define different namespaces for cat and dog to ensure that their inner elements have unique names, as demonstrated in Listing 14-11 and Listing 14-12.

For information on PHP namespaces, refer to Chapter 5 - Cutting Edge PHP.

The first part to having an XML namespace, is declaring one with xmlns:your_namespace:

<animals xmlns:dog='http://foobar.com:dog' xmlns:cat='http://foobar.com:cat'>

You then prefix the namespace to an element. When you want to retrieve dog names, you could search for dog:name which would filter out only the dog names.

Listing 14-11 shows how to work with namespaces in an XML document.

Listing 14-11. Failing to Find Content in a Document with Unregistered Namespaces Using XPath

<?php

error_reporting(E_ALL ^ E_NOTICE);

$xml = <<<THE_XML
<animals xmlns:dog="http://foobar.com/dog" xmlns:cat="http://foobar.com/cat" >
  <dog:name>snoopy</dog:name>
  <dog:color>brown</dog:color>
  <dog:breed>beagle cross</dog:breed>
  <cat:name>teddy</cat:name>
  <cat:color>brown</cat:color>
  <cat:breed>tabby</cat:breed>
  <dog:name>jade</dog:name>
  <dog:color>black</dog:color>
  <dog:breed>lab cross</dog:breed>
</animals>
THE_XML;

$xml_object = simplexml_load_string($xml);
$names = $xml_object->xpath("name");

foreach ($names as $name) {
    print $name . "<br/>";
}
?>

Running Listing 10-11 which contains namespaces outputs nothing. When running XPath, we need to register our namespace. See Listing 14-12.

Listing 14-12. Finding Content in a Document with Registered Namespaces Using XPath

<?php

error_reporting(E_ALL ^ E_NOTICE);

$xml = <<<THE_XML
<animals xmlns:dog="http://foobar.com/dog" xmlns:cat="http://foobar.com/cat" >
  <dog:name>snoopy</dog:name>
  <dog:color>brown</dog:color>
  <dog:breed>beagle cross</dog:breed>
  <cat:name>teddy</cat:name>
  <cat:color>brown</cat:color>
  <cat:breed>tabby</cat:breed>
  <dog:name>jade</dog:name>
  <dog:color>black</dog:color>
  <dog:breed>lab cross</dog:breed>
</animals>
THE_XML;

$xml_object = simplexml_load_string($xml);

$xml_object->registerXPathNamespace('cat', 'http://foobar.com/cat'),
$xml_object->registerXPathNamespace('dog', 'http://foobar.com/dog'),
$names = $xml_object->xpath("dog:name");

foreach ($names as $name) {
    print $name . "<br/>";
}
?>

The output is as follows:

snoopy
jade

In Listing 14-12, after registering the namespace with XPath, we need to prefix it to our query elements.

In Listing 14-13, we will use XPath to match an element by value. Then we will read an attribute value of this element..

Listing 14-13. Finding an Attribute Value of an Element with a Certain Value Using XPath

?<?php

error_reporting(E_ALL);

$xml = <<<THE_XML
<animals>
  <dog>
    <name id="1">snoopy</name>
    <color>brown</color>
    <breed>beagle cross</breed>
  </dog>
  <cat>
    <name id="2">teddy</name>
    <color>brown</color>
    <breed>tabby</breed>
  </cat>
  <dog>
    <name id="3">jade</name>
    <color>black</color>
    <breed>lab cross</breed>
  </dog>
</animals>
THE_XML;

$xml_object = simplexml_load_string($xml);

$result = $xml_object->xpath("dog/name[contains(., 'jade')]");
print (String)$result[0]->attributes()->id;

?>

In Listing 14-13, we use the XPath function contains which takes two parameters, the first being where to search – '.' standing for the current node, and the second being the search string. This function has a (haystack, needle) parameter format. We then receive a matching SimpleXMLObject and output the id attribute of it.

XPath is very powerful, and anyone familiar with a high level JavaScript language, like jQuery already knows much of the syntax. Learning XPath and the DOM will save you a lot of time and make your scripts more dependable

RSS

Really Simple Syndication (RSS) provides an easy method to publish and subscribe to content as a feed.

Any RSS feed will do, but take as an example the feed from the magazine Wired. The feed is available at http://feeds.wired.com/wired/index?format=xml. The source of the feed looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.wired.com
/~d/styles/itemcontent.css"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:feedburner=
"http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
  <channel>
    <title>Wired Top Stories</title>
    <link>http://www.wired.com/rss/index.xml</link>
    <description>Top Stories&lt;img src="http://www.wired.com/rss_views
/index.gif"&gt;</description>
    <language>en-us</language>
    <copyright>Copyright 2007 CondeNet Inc. All rights reserved.</copyright>
    <pubDate>Sun, 27 Feb 2011 16:07:00 GMT</pubDate>
    <category />
    <dc:creator>Wired.com</dc:creator>
    <dc:subject />
    <dc:date>2011-02-27T16:07:00Z</dc:date>
    <dc:language>en-us</dc:language>
    <dc:rights>Copyright 2007 CondeNet Inc. All rights reserved.</dc:rights>
    <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self"
 type="application/rss+xml" href="http://feeds.wired.com/wired/index" /><feedburner:info
 uri="wired/index" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub"
 href="http://pubsubhubbub.appspot.com/" />)

<item>
      <title>Peers Or Not? Comcast And Level 3 Slug It Out At FCC's Doorstep</title>
      <link>http://feeds.wired.com/~r/wired/index/~3/QJQ4vgGV4qM/</link>
      <description>the first description</description>
      <pubDate>Sun, 27 Feb 2011 16:07:00 GMT</pubDate>
      <guid isPermaLink="false">http://www.wired.com/epicenter/2011/02
/comcast-level-fcc/</guid>
      <dc:creator>Matthew Lasar</dc:creator>
      <dc:date>2011-02-27T16:07:00Z</dc:date>
    <feedburner:origLink>http://www.wired.com/epicenter/2011/02
/comcast-level-fcc/</feedburner:origLink></item>

    <item>
      <title>360 Cams, AutoCAD and Miles of Fiber: Building an Oscars Broadcast</title>
      <link>http://feeds.wired.com/~r/wired/index/~3/vFb527zZQ0U/</link>
      <description>the second description</description>
      <pubDate>Sun, 27 Feb 2011 00:19:00 GMT</pubDate>
      <guid isPermaLink="false">http://www.wired.com/underwire/2011/02
/oscars-broadcast/</guid>

      <dc:creator>Terrence Russell</dc:creator>
      <dc:date>2011-02-27T00:19:00Z</dc:date>
    <feedburner:origLink>http://www.wired.com/underwire/2011/02
/oscars-broadcast/</feedburner:origLink></item>



</channel>
</rss>

For brevity, the descriptions have been replaced. You can see that an RSS document is just XML. Many libraries exist for parsing content from an XML feed and we show how to parse this feed using SimplePie in Chapter 10 - Libraries. However, with your knowledge of XML, you can easily parse the content yourself.

Listing 14-14 is an example that builds a table with just the essentials from the feed. It has the article title which links to the full article, the creator of the document and the publish date. Notice that in the XML, the creator element is under a namespace so we retrieve it with XPath. The output is shown in Figure 14-1.

Listing 14-14. Parsing the Wired RSS Feed: wired_rss.php

<table>
    <tr><th>Story</th><th>Date</th><th>Creator</th></tr>
<?php
error_reporting(E_ALL);
$xml = simplexml_load_file("http://feeds.wired.com/wired/index?format=xml");

foreach($xml->channel->item as $item){
    print "<tr><td><a href='".$item->link."'>".$item->title."</a></td>";
    print "<td>".$item->pubDate."</td>";
    $creator_by_xpath = $item->xpath("dc:creator");
    print "<td>".(String)$creator_by_xpath[0]."</td></tr>";

    //equivalent creator, using children function instead of xpath function
    //$creator_by_namespace = $item->children('http://purl.org/dc/elements/1.1/')->creator;
    //print "<td>".(String)$creator_by_namespace[0]."</td></tr>";
}
?>
</table>
images

Figure 14-1. Output of our RSS feed parser from Listing 14-14)

In Listing 14-14, we used XPath to get the creator element which belongs to the namespace dc.

We could also have retrieved the children of our $item element with a particular namespace. This is a two step process. First we have to find what dc represents.

<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

The second step is passing in this namespace address as a parameter to the children function

  //$creator_by_namespace = $item->children('http://purl.org/dc/elements/1.1/')->creator;

Generating XML with SimpleXML

We have used SimpleXML exclusively to parse existing XML. However, we can also use it to generate an XML document from existing data. This data could be in the form of an array, an object, or a database.

To programmatically create an XML document, we need to create a new SimpleXMLElement that will point to our document root. Then we can add child elements to the root, and child elements of these children. See Listing 14-15.

Listing 14-15. Generating a Basic XML Document with SimpleXML

<?php

error_reporting(E_ALL ^ E_NOTICE);

//generate the xml, starting with the root
$animals = new SimpleXMLElement('<animals/>'),
$animals->{0} = 'Hello World';

$animals->asXML('animals.xml'),

//verify no errors with our newly created output file
var_dump(simplexml_load_file('animals.xml'));

?>

Outputs:

object(SimpleXMLElement)[2]
  string 'Hello World' (length=11)

And produces the file animals.xml with contents:

<?xml version="1.0"?>
<animals>Hello World</animals>=

Listing 14-15 creates a root element, <animal>, assigns a value to it, and calls the method asXML to save to a file. To test that this worked, we then load the saved file and output the contents. Ensure that you have write access to the file location.

In Listing 14-16, which is the compliment of Listing 14-4, we have animal data stored as arrays and want to create an XML document from the information.

Listing 14-16. Generating a Basic XML Document with SimpleXML

<?php

error_reporting(E_ALL ^ E_NOTICE);

//our data, stored in arrays
$dogs_array = array(
    array("name" => "snoopy",
        "color" => "brown",
        "breed" => "beagle cross"
    ),
    array("name" => "jade",
        "color" => "black",
        "breed" => "lab cross"
    ),
);

$cats_array = array(
    array("name" => "teddy",
        "color" => "brown",
        "breed" => "tabby"
    ),
);

//generate the xml, starting with the root
$animals = new SimpleXMLElement('<animals/>'),

$cats_xml = $animals->addChild('cats'),
$dogs_xml = $animals->addChild('dogs'),

foreach ($cats_array as $c) {
    $cat = $cats_xml->addChild('cat'),
    foreach ($c as $key => $value) {
        $tmp = $cat->addChild($key);
        $tmp->{0} = $value;
    }
}

foreach ($dogs_array as $d) {
    $dog = $dogs_xml->addChild('dog'),
    foreach ($d as $key => $value) {
        $tmp = $dog->addChild($key);
        $tmp->{0} = $value;
    }
}

var_dump($animals);
$animals->asXML('animals.xml'),

print '<br/><br/>';
//verify no errors with our newly created output file
var_dump(simplexml_load_file('animals.xml'));

?>

In Listing 14-16, we create a new SimpleXMLElement root with the call new SimpleXMLElement('<animals/>'). To populate our document from the top level elements downward, we create children by calling addChild and store a reference to the newly created element. Using the element reference, we can add child elements. By repeating this process, we can generate an entire tree of nodes.

Unfortunately, the output function asXML() does not format our output nicely at all. Everything appears as a single line. To get around this, we can use the DOMDocument class, which we will discuss later in this chapter to output the XML nicely.

$animals_dom = new DOMDocument('1.0'),
$animals_dom->preserveWhiteSpace = false;
$animals_dom->formatOutput = true;
//returns a DOMElement
$animals_dom_xml = dom_import_simplexml($animals);
$animals_dom_xml = $animals_dom->importNode($animals_dom_xml, true);
$animals_dom_xml = $animals_dom->appendChild($animals_dom_xml);
$animals_dom->save('animals_formatted.xml'),

This code creates a new DOMDocument object and sets it to format the output. Then we import the SimpleXMLElement object into a new DOMElement object. We import the node recursively into our document and then save the formatted output to a file. Replacing the above code for the asXML call in Listing 14-16 results in clean, nested output:

<?xml version="1.0"?>
<animals>
  <cats>
    <cat>
      <name>teddy</name>
      <color>brown</color>
      <breed>tabby</breed>
    </cat>
  </cats>
  <dogs>
    <dog>
      <name>snoopy</name>
      <color>brown</color>
      <breed>beagle cross</breed>
    </dog>
    <dog>
      <name>jade</name>
      <color>black</color>
      <breed>lab cross</breed>
    </dog>
  </dogs>
</animals>

images Note SimpleXML can also import DOM objects via the function simplexml_import_dom.

<?php

error_reporting(E_ALL ^ ~E_STRICT);
$dom_xml = DOMDocument::loadXML("<root><name>Brian</name></root>");
$simple_xml = simplexml_import_dom($dom_xml);
print $simple_xml->name; // brian

?>

In Listing 14-17, we will generate an RSS sample with namespaces and attributes. Our goal will be to output an XML document with the following structure:

<?xml version="1.0" ?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
<channel>
        <title>Brian’s RSS Feed</title>
        <description>Brian’s Latest Blog Entries</description>
        <link>http://www.briandanchilla.com/node/feed </link>
        <lastBuildDate>Fri, 04 Feb 2011 00:11:08 +0000 </lastBuildDate>
        <pubDate>Fri, 04 Feb 2011 08:25:00 +0000 </ pubDate>
        <item>
                <title>Pretend Topic </title>
                <description>Pretend description</description>
                <link>http://www.briandanchilla.com/pretend-link/</link>
                <guid>unique generated string</guid>
                <dc:pubDate>Fri, 04 Feb 2011 08:25:00 +0000 </dc:pubDate>
        </item>
</channel>
</rss>

Listing 14-17. Generating a RSS Document with SimpleXML

<?php

error_reporting(E_ALL);

$items = array(
    array(
        "title" => "a",
        "description" => "b",
        "link" => "c",
        "guid" => "d",
        "lastBuildDate" => "",
        "pubDate" => "e"),

    array(
        "title" => "a2",
        "description" => "b2",
        "link" => "c2",
        "guid" => "d2",
        "lastBuildDate" => "",
        "pubDate" => "e2"),
);

$rss_xml = new SimpleXMLElement('<rss xmlns:dc="http://purl.org/dc/elements/1.1/"/>'),
$rss_xml->addAttribute('version', '2.0'),
$channel = $rss_xml->addChild('channel'),

foreach ($items as $item) {
    $item_tmp = $channel->addChild('item'),

    foreach ($item as $key => $value) {
        if ($key == "pubDate") {
            $tmp = $item_tmp->addChild($key, $value, "http://purl.org/dc/elements/1.1/");
        } else if($key == "lastBuildDate") {
            //Format will be: Fri, 04 Feb 2011 00:11:08 +0000
            $tmp = $item_tmp->addChild($key, date('r', time()));
        } else {
            $tmp = $item_tmp->addChild($key, $value);
        }
    }
}

//for nicer formatting
$rss_dom = new DOMDocument('1.0'),
$rss_dom->preserveWhiteSpace = false;
$rss_dom->formatOutput = true;
//returns a DOMElement
$rss_dom_xml = dom_import_simplexml($rss_xml);
$rss_dom_xml = $rss_dom->importNode($rss_dom_xml, true);
$rss_dom_xml = $rss_dom->appendChild($rss_dom_xml);
$rss_dom->save('rss_formatted.xml'),
?>

The main lines in Listing 14-17 are setting the namespace in the root element, $rss_xml = new SimpleXMLElement('<rss xmlns:dc="http://purl.org/dc/elements/1.1/"/>'), fetching the namespace if the item key is pubDate and generating an RFC 2822 formatted date if the key is lastBuildDate.

The contents of the file after running Listing 14-17 will be similar to this:

<?xml version="1.0"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <item>
      <title>a</title>
      <description>b</description>
      <link>c</link>
      <guid>d</guid>
      <lastBuildDate>Fri, 27 May 2011 01:20:04 +0200</lastBuildDate>
      <dc:pubDate>e</dc:pubDate>
    </item>
    <item>
      <title>a2</title>
      <description>b2</description>
      <link>c2</link>
      <guid>d2</guid>
      <lastBuildDate>Fri, 27 May 2011 01:20:04 +0200</lastBuildDate>
      <dc:pubDate>e2</dc:pubDate>
    </item>
  </channel>
</rss>

images Note More information on SimpleXML can be found at http://php.net/manual/en/book.simplexml.php.

images To troubleshoot an XML document that is not validating, you can use the online validator at http://validator.w3.org/check.

DOMDocument

As mentioned at the start of the chapter, SimpleXML is by no means the only option available for XML manipulation in PHP. Another popular XML extension is the DOM. We have already seen that the DOMDocument has some more powerful features than SimpleXML, in its output formatting. The DOMDocument is more powerful than SimpleXML, but as you would expect, is not as straightforward to use.

The majority of the time you would probably choose to use SimpleXML over DOM. However the DOM extension has these additional features:

  • Follows the W3C DOM API, so if you are familiar with JavaScript DOM this will be easy to adapt to.
  • Supports HTML parsing.
  • Distinct node types offer more control.
  • Can append raw XML to an existing XML document.
  • Makes it easier to modify an existing document by updating or removing nodes.
  • Provides better support for CDATA and comments.

With SimpleXML, all nodes are the same. So an element uses the same underlying object as an attribute. The DOM has separate node types. These are XML_ELEMENT_NODE, XML_ATTRIBUTE_NODE, and XML_TEXT_NODE. Depending on the type, the corresponding object properties are tagName for elements, name and value for attributes, and nodeName and nodeValue for text.

//creating a DOMDocument object
$dom_xml = new DOMDocument();

DOMDocument can load XML from a string, file, or imported from a SimpleXML object.

//from a string
$dom_xml->loadXML('the full xml string'),

// from a file
$dom_xml->load('animals.xml'),

// imported from a SimpleXML object
$dom_element = dom_import_simplexml($simplexml);
$dom_element = $dom_xml->importNode($dom_element, true);
$dom_element = $dom_xml->appendChild($dom_element);

To maneuver around a DOM object, you would do so with object oriented calls to functions like:

$dom_xml->item(0)->firstChild->nodeValue
$dom_xml->childNodes
$dom_xml->parentNode
$dom_xml->getElementsByTagname('div'),

There are several save functions available - save, saveHTML, saveHTMLFile, and saveXML.

DOMDocument has a validate function to check if a document is legal. To use XPath with the DOM you would need to construct a new DOMXPath object.

$xpath = new DOMXPath($dom_xml);

To better illustrate the difference between the SimpleXML and DOM extensions, the next two examples using the DOM are equivalent to examples done earlier in the chapter using SimpleXML.

Listing 14-18 outputs all of the animal names and the type of animal in brackets. It is equivalent to Listing 14-9 which used SimpleXML.

Listing 14-18. Finding Elements with DOM

<?php

error_reporting(E_ALL ^ E_NOTICE);

$xml = <<<THE_XML
<animals>
  <dog>
    <name>snoopy</name>
    <color>brown</color>
    <breed>beagle cross</breed>
  </dog>
  <cat>
    <name>teddy</name>
    <color>brown</color>
    <breed>tabby</breed>
  </cat>
  <dog>
    <name>jade</name>
    <color>black</color>
    <breed>lab cross</breed>
  </dog>
</animals>
THE_XML;

$xml_object = new DOMDocument();
$xml_object->loadXML($xml);
$xpath = new DOMXPath($xml_object);

$names = $xpath->query("*/name");
foreach ($names as $element) {
    $parent_type = $element->parentNode->nodeName;
    echo "$element->nodeValue ($parent_type)<br/>";
}
?>

Notice that in Listing 14-18 that we need to construct a DOMXPath object and then call its query method. Unlike in Listing 14-9, we can directly access the parent. Finally, observe that we access node values and names as properties in the previous listing, and through method calls in Listing 14-9.

Listing 14-19 shows how to search for an element value and then find an attribute value of the element. It is the DOM equivalent of Listing 14-13..

Listing 14-19. Searching for Element and Attribute Values with DOM

<?php

error_reporting(E_ALL);

$xml = <<<THE_XML
<animals>
  <dog>
    <name id="1">snoopy</name>
    <color>brown</color>
    <breed>beagle cross</breed>
  </dog>
  <cat>
    <name id="2">teddy</name>
    <color>brown</color>
    <breed>tabby</breed>
  </cat>
  <dog>
    <name id="3">jade</name>
    <color>black</color>
    <breed>lab cross</breed>
  </dog>
</animals>
THE_XML;

$xml_object = new DOMDocument();
$xml_object->loadXML($xml);
$xpath = new DOMXPath($xml_object);

$results = $xpath->query("dog/name[contains(., 'jade')]");
foreach ($results as $element) {
    print $element->attributes->getNamedItem("id")->nodeValue;
}
?>

The main thing to note in Listing 14-19 is that with the DOM, we use attributes->getNamedItem("id")->nodeValue to find the id attribute element. With SimpleXML, in Listing 14-13, we used attributes()->id.

XMLReader and XMLWriter

The XMLReader and XMLWriter extensions are used together. They are more difficult to use than the SimpleXML or DOM extensions. However, for very large documents, using XMLReader and XMLWriter is a good choice (often the only choice) as the reader and writer are event based and do not require the entire document to be loaded into memory. However, since the XML is not loaded all at once, one of the prerequisites for using XMLReader or XMLWriter is that the exact schema of the XML should be well known beforehand.

You can obtain most values with XMLReader by repeatedly calling read(), looking up the nodeType and obtaining the value.

Listing 14-20 is the XMLReader equivalent of Listing 14-4, which uses SimpleXML.

Listing 14-20. Finding Elements with XMLReader

<?php

error_reporting(E_ALL ^ E_NOTICE);

$xml = <<<THE_XML
<animals>
  <dog>
    <name>snoopy</name>
    <color>brown</color>
    <breed>beagle cross</breed>
  </dog>
  <cat>
    <name>teddy</name>
    <color>brown</color>
    <breed>tabby</breed>
  </cat>
  <dog>
    <name>jade</name>
    <color>black</color>
    <breed>lab cross</breed>
  </dog>
</animals>
THE_XML;

$xml_object = new XMLReader();
$xml_object->XML($xml);
$dog_parent = false;
while ($xml_object->read()) {
    if ($xml_object->nodeType == XMLREADER::ELEMENT) {
        if ($xml_object->name == "cat") {
            $dog_parent = false;
        } else if ($xml_object->name == "dog") {
            $dog_parent = true;
        } else
        if ($xml_object->name == "name" && $dog_parent) {
            $xml_object->read();
            if ($xml_object->nodeType == XMLReader::TEXT) {
                print $xml_object->value . "<br/>";
                $dog_parent = false;
            }
        }
    }
}
?>

Notice that Listing 14-20 contains no namespace elements or usage of XPath and is still complex.

A useful XMLReader function is expand(). It will return a copy of the current node as a DOMNode. This means that you now have access to searching the subtree by tag name.

                $subtree = $xml_reader->expand();
                $breeds = $subtree->getElementsByTagName('breed'),

Of course, you would only want to do this on subtrees that are not very large themselves. XMLReader and XMLWriter are much more complex than the tree based extensions and have around twenty node types alone. The difficulty of XMLReader and XMLWriter compared to SimpleXML and DOM make it used only when necessary.

Summary

XML is a very useful tool to communicate and store data with. The cross language, platform independent nature of XML makes it ideal for many applications. XML documents can range from being simple and straightforward to being very complex in nature, with elaborate schemas and multiple namespaces.

In this chapter we gave an overview of XML. Then we covered parsing and generating XML documents with the SimpleXML extension. SimpleXML makes working with XML very easy while still being powerful. We showed how to find element and attribute values and how to handle namespaces.

Though SimpleXML is the best solution for most documents, alternatives such as the DOM and XMLReader should be used when appropriate. For XHTML documents, DOM makes sense and for very large documents XMLReader and XMLWriter might be the only viable option. In any case, knowledge of multiple XML parsers is never bad.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset