Extensible Markup Language (XML) is a very powerful tool for data storage and transfer. When a document is written in XML, it can be universally understood and exchanged. The utility of XML being a worldwide standard should not be underestimated. XML is used for modern word processor documents, SOAP and REST web services, RSS feeds, and XHTML documents.
In this chapter, we will primarily cover the PHP SimpleXML extension, which makes it very easy to manipulate XML documents. We will also touch on the Document Object Model (DOM) and XMLReader extensions. The DOM guarantees that the document is viewed the same no matter what computer language is using it. The main reasons that more than one library for parsing and writing XML exists are ease of use, depth of functionality, and the manner in which the XML is manipulated.
XML allows us to define documents that use any tag elements or attributes we want. When viewing an XML document in a text editor, you may notice that it resembles HTML. This is because, like HTML (HyperText Markup Language), XML is a markup language, containing a collection of tagged content in a hierarchical structure. The hierarchical structure is tree-like, having a single root element (tag) acting as the trunk, child elements branching off of the root, and further descendants branching off of their parent elements. You can also view an XML document in order as a series of discrete events. Notice that viewing the elements in order requires no knowledge of what the entire document represents, but also makes searching for elements more difficult.
A specific example of an XML ”application” is XHTML. XHTML is similar to HTML in the fact that the same tags are used. However XHTML also adheres to XML standards and so is more strict. XHTML has the following additional requirements:
<br>,
need to be closed off. In this case, we would use <br />.
&, <, >, ', "
need to be being escaped as &, <, >, '
and "
respectively<img src=dog.jpg />
is illegal while <img src="dog.jpg" />
is legal.To parse XML, we can use a tree-based or event driven model. Tree-based models like those used in SimpleXML and the DOM represent HTML and XML documents as a tree of elements and load the entire document into memory. Each element, except the root, has a parent element. Elements may contain attributes and values. Event-based models such as the Simple API for XML (SAX) read only part of the XML document at a time. For large documents, SAX is faster; for extremely large documents, it can be the only viable option. However, tree-based models are usually easier and more intuitive to work with and some XML documents require that the document be loaded all at once.
A basic XML document can look like the following:
<animal>
<type id="9">dog</type>
<name>snoopy</name>
</animal>
The root element is <animal>
and there are two child elements, <type>
and <name>
. The value of the <type>
element is “dog” and the value of the <name>
element is “snoopy.” The <type>
element has one attribute, id
, which has a value of “9.” Furthermore, each opening tag has a matching closing tag and the attribute value is enclosed in quotations.
An XML schema provides additional constraints on an XML document. Examples of constraints are specific elements that are optional or must be included, the acceptable values and attributes of an element, and where elements can be placed.
Without a schema, there is nothing preventing us from having nonsensical data like what you see in Listing 14-1.
Listing 14-1. Example Showing the Need for a Stricter Schema
<animals>
<color>black</color>
<dog>
<name>snoopy</name>
<breed>
<cat>
<color>brown</color>
<breed>tabby</breed>
</cat>
beagle cross
</breed>
</dog>
<name>teddy</name>
</animals>
This document does not make much sense to humans. A cat
cannot be part of a dog
. color
and name
are not animals and should be enclosed within a dog
or cat
element. However, from a machine perspective, this is a perfectly valid document. We have to tell the machine the reasons why this document is not acceptable. A schema allows us to inform the machine how to enforce how data is laid out. This added rigidity ensures more integrity of data in the document. With a schema we can explicitly say that <cat>
tags cannot go inside of <dog>
tags. We can also say that <name>
and <color>
tags can only go directly inside of <cat>
or <dog>
tags.
The three most popular schema generation languages are the Document Type Definition (DTD), XML Schema, and RELAX NG (REgular LAnguage for XML Next Generation). As this book is focused on PHP, we will not go over creating a schema, but simply mention that you declare the schema at the start of your document. See Listing 14-2.
Listing 14-2. Code Snippet Showing the Declaration of Using the xhtml1-transitional Schema
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
SimpleXML makes it easy to store XML as a PHP object and vice versa. SimpleXML simplifies traversal of the XML structure and finding specific elements. The SimpleXML extension requires PHP 5 or higher and is enabled by default.
Let us dive right in to our first example. We will load XML that is in a string into a SimpleXMLElement
object and traverse the structure. See Listing 14-3.
Listing 14-3. First Example: animal.php
<?php
error_reporting(E_ALL ^ E_NOTICE);
$xml = <<<THE_XML
<animal>
<type>dog</type>
<name>snoopy</name>
</animal>
THE_XML;
//to load the XML string into a SimpleXMLElement object takes one line
$xml_object = simplexml_load_string($xml);
foreach ($xml_object as $element => $value) {
print $element . ": " . $value . "<br/>";
}
?>
After the XML string is loaded in Listing 14-3, $xml_object
is at the root element, <animal>.
The document is represented as a SimpleXMLElement
object, so we can iterate through the child elements using a foreach
loop. The output of Listing 14-3 is the following:
Listing 14-4. A More Complex Example: animals.php
<?php
error_reporting(E_ALL ^ E_NOTICE);
$xml = <<<THE_XML
<animals>
<dog>
<name>snoopy</name>
<color>brown</color>
<breed>beagle cross</breed>
</dog>
<cat>
<name>teddy</name>
<color>brown</color>
<breed>tabby</breed>
</cat>
<dog>
<name>jade</name>
<color>black</color>
<breed>lab cross</breed>
</dog>
</animals>
THE_XML;
$xml_object = simplexml_load_string($xml);
//output all of the dog names
foreach($xml_object->dog as $dog){
print $dog->name."<br/>";
}
?>
The output of Listing 14-4 is the following:
snoopy
jade
Most of Listing 14-4 is using PHP heredoc
syntax to load a string in a readable fashion. The actual code involved to find the element values was a few lines . Simple indeed. SimpleXML is smart enough to iterate over all the <dog>
tags, even with a <cat>
tag between <dog>
tags.
When you are loading XML, if the document is invalid then PHP will complain with a helpful warning message. The message could inform that you need to close a tag or escape an entity, and will indicate the line number of the error. See Listing 14-5.
Listing 14-5. Sample PHP Warning Message for Invalid XML
Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 1: parser
error : attributes construct error in E:xampphtdocsxmlanimals.php on line 29
Our next two examples will load in XML from a file that is shown in Listing 14-6. Some of the XML elements have attributes. We will show in Listing 14-7 how to find attribute values naively, by using repeated SimpleXML functional calls. Then in Listing 14-10, we will show how to find the attribute values by using XPath, which is meant to simplify searches.
Listing 14-6. Our Sample XHTML File: template.xhtml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<body>
<div id="header">
header would be here
</div>
<div id="menu">
menu would be here
</div>
<div id="main_content">
<div id="main_left">
left sidebar
</div>
<div id="main_center" class="foobar">
main story
</div>
<div id="main_right">
right sidebar
</div>
</div>
<div id="footer">
footer would be here
</div>
</body>
</html>
The first two lines of Listing 14-6 define the version of XML used and the DOCTYPE
and are not part of the tree loaded into the SimpleXMLElement
. So the root is the <html>
element.
Listing 14-7 shows how to find the content of the <div>
with id="main_center"
using object-oriented SimpleXML methods.
Listing 14-7. Finding a Specific Value Based on an Attribute
<?php
error_reporting(E_ALL ^ E_NOTICE);
$xml = simplexml_load_file("template.xhtml");
findDivContentsByID($xml, "main_center");
function findDivContentsByID($xml, $id) {
foreach ($xml->body->div as $divs) {
if (!empty($divs->div)) {
foreach ($divs->div as $inner_divs) {
if (isElementWithID($inner_divs, $id)) {
break 2;
}
}
} else {
if (isElementWithID($divs, $id)) {
break;
}
}
}
}
function isElementWithID($element, $id) {
$actual_id = (String) $element->attributes()->id;
if ($actual_id == $id) {
$value = trim((String) $element);
print "value of #$id is: $value";
return true;
}
return false;
}
?>
Listing 14-7 will find all of the <div>
elements of the <body>
element and also direct child <div>
elements of those <div>
elements. Then each matching <div>
element has its id
attribute compared with our id search value, "main_center."
If they are equal, then we print out the value and break from the loop. The output of this script is as follows:
value of #main_center is: main story
We can not simply output $element
in our isElementWithID
function because we will output the entire SimpleXMLElement
object.
object(SimpleXMLElement)[9]
public '@attributes' =>
array
'id' => string 'main_center' (length=11)
'class' => string 'foobar' (length=6)
string '
main story
' (length=40)
So we need to cast the return value from an Object
into a String
. (Recall that casting explicitly converts a variable from one data type into another). Notice also that whitespace is captured in the element value, so we may need to use the PHP trim()
function on our string.
To get the attributes of an element, SimpleXML has the attributes()
function which returns an object of attributes.
var_dump($element->attributes());
object(SimpleXMLElement)[9]
public '@attributes' =>
array
'id' => string 'main_center' (length=11)
'class' => string 'foobar' (length=6)
We also need to cast the return value of $element->attributes()->id;
or we will again get an entire SimpleXMLElement
object back.
Listing 14-7 is not robust. If the structure of the document changes or is deeper than two levels, it will fail to find the id.
You may recognize that XHTML documents follow the familiar Document Object Model, or DOM, of HTML. Existing parsers and traversal utilities like XPath and XQuery make finding nested elements relatively easy. XPath is part of both the SimpleXML library and the PHP DOM library. With SimpleXML, you invoke XPath through a function call, $simple_xml_object->xpath()
. In the DOM library you use XPath by creating an object, DOMXPath
and then calling the object’s query method.
We will show how to find a specific id attribute with XPath in Listing 14-10. First we will show how to find the elements we retrieved in Listings 14-3 and 14-4 using XPath. See Listing 14-8.
Listing 14-8. Finding an Element Using XPath
<?php
error_reporting(E_ALL);
$xml = <<<THE_XML
<animal>
<type>dog</type>
<name>snoopy</name>
</animal>
THE_XML;
$xml_object = simplexml_load_string($xml);
$type = $xml_object->xpath("type");
foreach($type as $t) {
echo $t."<br/><br/>";
}
$xml_object = simplexml_load_string($xml);
$children = $xml_object->xpath("/animal/*");
foreach($children as $element) {
echo $element->getName().": ".$element."<br/>";
}
?>
The output of Listing 14-8 is:
dog
type: dog
name: snoopy
In the first part of Listing 14-8, we select the <type>
inner element of <animal>
using the XPath selector "type
". This returns an array of SimpleXMLElement
objects that match the XPath query. The second part of the listing selects all child elements of <animal>
using the XPath selector "/animal/*
", where
the asterisk is a wildcard. As SimpleXMLElement
objects are returned from the xpath()
call, we can also output the element names by using the getName()
method.
Note The complete specification covering XPath selectors can be viewed at www.w3.org/TR/xpath/.
Listing 14-9 shows how to match a specific child element regardless of the parent type. It also demonstrates how to find the parent element of a SimpleXMLElement
.
Listing 14-9. Matching Children and Parents Using XPath
<?php
error_reporting(E_ALL ^ E_NOTICE);
$xml = <<<THE_XML
<animals>
<dog>
<name>snoopy</name>
<color>brown</color>
<breed>beagle cross</breed>
</dog>
<cat>
<name>teddy</name>
<color>brown</color>
<breed>tabby</breed>
</cat>
<dog>
<name>jade</name>
<color>black</color>
<breed>lab cross</breed>
</dog>
</animals>
THE_XML;
$xml_object = simplexml_load_string($xml);
$names = $xml_object->xpath("*/name");
foreach ($names as $element) {
$parent = $element->xpath("..");
$type = $parent[0]->getName();
echo "$element ($type)<br/>";
}
?>
The output of Listing 14-9 will be this:
snoopy (dog)
teddy (cat)
jade (dog)
We have matched the <name>
element, regardless of whether it is contained in a <dog>
or <cat>
element with the XPath query "*/name
". To get the parent of our current SimpleXMLElement
, we used the query ".."
. We could of instead used the query "parent::*
".
Listing 14-10. Matching an Attribute Value Using XPath
<?php
error_reporting(E_ALL);
$xml = simplexml_load_file("template.xhtml");
$content = $xml->xpath("//*[@id='main_center']");
print (String)$content[0];
?>
In Listing 14-10, we used the query "//*[@id='main_center']
" to find the element with attribute id
equal to 'main_center'
. To match an attribute with XPath, we use the @
sign. Compare the simplicity of Listing 14-10 which uses XPath with that of Listing 14-7.
XML namespaces define what collection an element belongs to, preventing data ambiguity. This is important and can otherwise occur if you have distinct node types containing elements with the same name. For example, you could define different namespaces for cat
and dog
to ensure that their inner elements have unique names, as demonstrated in Listing 14-11 and Listing 14-12.
For information on PHP namespaces, refer to Chapter 5 - Cutting Edge PHP.
The first part to having an XML namespace, is declaring one with xmlns:your_namespace
:
<animals xmlns:dog='http://foobar.com:dog' xmlns:cat='http://foobar.com:cat'>
You then prefix the namespace to an element. When you want to retrieve dog names, you could search for dog:name
which would filter out only the dog
names.
Listing 14-11 shows how to work with namespaces in an XML document.
Listing 14-11. Failing to Find Content in a Document with Unregistered Namespaces Using XPath
<?php
error_reporting(E_ALL ^ E_NOTICE);
$xml = <<<THE_XML
<animals xmlns:dog="http://foobar.com/dog" xmlns:cat="http://foobar.com/cat" >
<dog:name>snoopy</dog:name>
<dog:color>brown</dog:color>
<dog:breed>beagle cross</dog:breed>
<cat:name>teddy</cat:name>
<cat:color>brown</cat:color>
<cat:breed>tabby</cat:breed>
<dog:name>jade</dog:name>
<dog:color>black</dog:color>
<dog:breed>lab cross</dog:breed>
</animals>
THE_XML;
$xml_object = simplexml_load_string($xml);
$names = $xml_object->xpath("name");
foreach ($names as $name) {
print $name . "<br/>";
}
?>
Running Listing 10-11 which contains namespaces outputs nothing. When running XPath, we need to register our namespace. See Listing 14-12.
Listing 14-12. Finding Content in a Document with Registered Namespaces Using XPath
<?php
error_reporting(E_ALL ^ E_NOTICE);
$xml = <<<THE_XML
<animals xmlns:dog="http://foobar.com/dog" xmlns:cat="http://foobar.com/cat" >
<dog:name>snoopy</dog:name>
<dog:color>brown</dog:color>
<dog:breed>beagle cross</dog:breed>
<cat:name>teddy</cat:name>
<cat:color>brown</cat:color>
<cat:breed>tabby</cat:breed>
<dog:name>jade</dog:name>
<dog:color>black</dog:color>
<dog:breed>lab cross</dog:breed>
</animals>
THE_XML;
$xml_object = simplexml_load_string($xml);
$xml_object->registerXPathNamespace('cat', 'http://foobar.com/cat'),
$xml_object->registerXPathNamespace('dog', 'http://foobar.com/dog'),
$names = $xml_object->xpath("dog:name");
foreach ($names as $name) {
print $name . "<br/>";
}
?>
snoopy
jade
In Listing 14-12, after registering the namespace with XPath, we need to prefix it to our query elements.
In Listing 14-13, we will use XPath to match an element by value. Then we will read an attribute value of this element..
Listing 14-13. Finding an Attribute Value of an Element with a Certain Value Using XPath
?<?php
error_reporting(E_ALL);
$xml = <<<THE_XML
<animals>
<dog>
<name id="1">snoopy</name>
<color>brown</color>
<breed>beagle cross</breed>
</dog>
<cat>
<name id="2">teddy</name>
<color>brown</color>
<breed>tabby</breed>
</cat>
<dog>
<name id="3">jade</name>
<color>black</color>
<breed>lab cross</breed>
</dog>
</animals>
THE_XML;
$xml_object = simplexml_load_string($xml);
$result = $xml_object->xpath("dog/name[contains(., 'jade')]");
print (String)$result[0]->attributes()->id;
?>
In Listing 14-13, we use the XPath function contains
which takes two parameters, the first being where to search – '.'
standing for the current node, and the second being the search string. This function has a (haystack, needle) parameter format. We then receive a matching SimpleXMLObject
and output the id
attribute of it.
XPath is very powerful, and anyone familiar with a high level JavaScript language, like jQuery already knows much of the syntax. Learning XPath and the DOM will save you a lot of time and make your scripts more dependable
Really Simple Syndication (RSS) provides an easy method to publish and subscribe to content as a feed.
Any RSS feed will do, but take as an example the feed from the magazine Wired. The feed is available at http://feeds.wired.com/wired/index?format=xml.
The source of the feed looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.wired.com
/~d/styles/itemcontent.css"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:feedburner=
"http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
<channel>
<title>Wired Top Stories</title>
<link>http://www.wired.com/rss/index.xml</link>
<description>Top Stories<img src="http://www.wired.com/rss_views
/index.gif"></description>
<language>en-us</language>
<copyright>Copyright 2007 CondeNet Inc. All rights reserved.</copyright>
<pubDate>Sun, 27 Feb 2011 16:07:00 GMT</pubDate>
<category />
<dc:creator>Wired.com</dc:creator>
<dc:subject />
<dc:date>2011-02-27T16:07:00Z</dc:date>
<dc:language>en-us</dc:language>
<dc:rights>Copyright 2007 CondeNet Inc. All rights reserved.</dc:rights>
<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self"
type="application/rss+xml" href="http://feeds.wired.com/wired/index" /><feedburner:info
uri="wired/index" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub"
href="http://pubsubhubbub.appspot.com/" />)
<item>
<title>Peers Or Not? Comcast And Level 3 Slug It Out At FCC's Doorstep</title>
<link>http://feeds.wired.com/~r/wired/index/~3/QJQ4vgGV4qM/</link>
<description>the first description</description>
<pubDate>Sun, 27 Feb 2011 16:07:00 GMT</pubDate>
<guid isPermaLink="false">http://www.wired.com/epicenter/2011/02
/comcast-level-fcc/</guid>
<dc:creator>Matthew Lasar</dc:creator>
<dc:date>2011-02-27T16:07:00Z</dc:date>
<feedburner:origLink>http://www.wired.com/epicenter/2011/02
/comcast-level-fcc/</feedburner:origLink></item>
<item>
<title>360 Cams, AutoCAD and Miles of Fiber: Building an Oscars Broadcast</title>
<link>http://feeds.wired.com/~r/wired/index/~3/vFb527zZQ0U/</link>
<description>the second description</description>
<pubDate>Sun, 27 Feb 2011 00:19:00 GMT</pubDate>
<guid isPermaLink="false">http://www.wired.com/underwire/2011/02
/oscars-broadcast/</guid>
<dc:creator>Terrence Russell</dc:creator>
<dc:date>2011-02-27T00:19:00Z</dc:date>
<feedburner:origLink>http://www.wired.com/underwire/2011/02
/oscars-broadcast/</feedburner:origLink></item>
…
…
…
</channel>
</rss>
For brevity, the descriptions have been replaced. You can see that an RSS document is just XML. Many libraries exist for parsing content from an XML feed and we show how to parse this feed using SimplePie in Chapter 10 - Libraries. However, with your knowledge of XML, you can easily parse the content yourself.
Listing 14-14 is an example that builds a table with just the essentials from the feed. It has the article title which links to the full article, the creator of the document and the publish date. Notice that in the XML, the creator
element is under a namespace so we retrieve it with XPath. The output is shown in Figure 14-1.
Listing 14-14. Parsing the Wired RSS Feed: wired_rss.php
<table>
<tr><th>Story</th><th>Date</th><th>Creator</th></tr>
<?php
error_reporting(E_ALL);
$xml = simplexml_load_file("http://feeds.wired.com/wired/index?format=xml");
foreach($xml->channel->item as $item){
print "<tr><td><a href='".$item->link."'>".$item->title."</a></td>";
print "<td>".$item->pubDate."</td>";
$creator_by_xpath = $item->xpath("dc:creator");
print "<td>".(String)$creator_by_xpath[0]."</td></tr>";
//equivalent creator, using children function instead of xpath function
//$creator_by_namespace = $item->children('http://purl.org/dc/elements/1.1/')->creator;
//print "<td>".(String)$creator_by_namespace[0]."</td></tr>";
}
?>
</table>
Figure 14-1. Output of our RSS feed parser from Listing 14-14)
In Listing 14-14, we used XPath to get the creator
element which belongs to the namespace dc
.
We could also have retrieved the children of our $item
element with a particular namespace. This is a two step process. First we have to find what dc
represents.
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
The second step is passing in this namespace address as a parameter to the children
function
//$creator_by_namespace = $item->children('http://purl.org/dc/elements/1.1/')->creator;
We have used SimpleXML exclusively to parse existing XML. However, we can also use it to generate an XML document from existing data. This data could be in the form of an array, an object, or a database.
To programmatically create an XML document, we need to create a new SimpleXMLElement
that will point to our document root. Then we can add child elements to the root, and child elements of these children. See Listing 14-15.
Listing 14-15. Generating a Basic XML Document with SimpleXML
<?php
error_reporting(E_ALL ^ E_NOTICE);
//generate the xml, starting with the root
$animals = new SimpleXMLElement('<animals/>'),
$animals->{0} = 'Hello World';
$animals->asXML('animals.xml'),
//verify no errors with our newly created output file
var_dump(simplexml_load_file('animals.xml'));
?>
Outputs:
object(SimpleXMLElement)[2]
string 'Hello World' (length=11)
And produces the file animals.xml
with contents:
<?xml version="1.0"?>
<animals>Hello World</animals>=
Listing 14-15 creates a root element, <animal>
, assigns a value to it, and calls the method asXML
to save to a file. To test that this worked, we then load the saved file and output the contents. Ensure that you have write access to the file location.
In Listing 14-16, which is the compliment of Listing 14-4, we have animal data stored as arrays and want to create an XML document from the information.
Listing 14-16. Generating a Basic XML Document with SimpleXML
<?php
error_reporting(E_ALL ^ E_NOTICE);
//our data, stored in arrays
$dogs_array = array(
array("name" => "snoopy",
"color" => "brown",
"breed" => "beagle cross"
),
array("name" => "jade",
"color" => "black",
"breed" => "lab cross"
),
);
$cats_array = array(
array("name" => "teddy",
"color" => "brown",
"breed" => "tabby"
),
);
//generate the xml, starting with the root
$animals = new SimpleXMLElement('<animals/>'),
$cats_xml = $animals->addChild('cats'),
$dogs_xml = $animals->addChild('dogs'),
foreach ($cats_array as $c) {
$cat = $cats_xml->addChild('cat'),
foreach ($c as $key => $value) {
$tmp = $cat->addChild($key);
$tmp->{0} = $value;
}
}
foreach ($dogs_array as $d) {
$dog = $dogs_xml->addChild('dog'),
foreach ($d as $key => $value) {
$tmp = $dog->addChild($key);
$tmp->{0} = $value;
}
}
var_dump($animals);
$animals->asXML('animals.xml'),
print '<br/><br/>';
//verify no errors with our newly created output file
var_dump(simplexml_load_file('animals.xml'));
?>
In Listing 14-16, we create a new SimpleXMLElement
root with the call new SimpleXMLElement('<animals/>')
. To populate our document from the top level elements downward, we create children by calling addChild
and store a reference to the newly created element. Using the element reference, we can add child elements. By repeating this process, we can generate an entire tree of nodes.
Unfortunately, the output function asXML()
does not format our output nicely at all. Everything appears as a single line. To get around this, we can use the DOMDocument
class, which we will discuss later in this chapter to output the XML nicely.
$animals_dom = new DOMDocument('1.0'),
$animals_dom->preserveWhiteSpace = false;
$animals_dom->formatOutput = true;
//returns a DOMElement
$animals_dom_xml = dom_import_simplexml($animals);
$animals_dom_xml = $animals_dom->importNode($animals_dom_xml, true);
$animals_dom_xml = $animals_dom->appendChild($animals_dom_xml);
$animals_dom->save('animals_formatted.xml'),
This code creates a new DOMDocument
object and sets it to format the output. Then we import the SimpleXMLElement
object into a new DOMElement
object. We import the node recursively into our document and then save the formatted output to a file. Replacing the above code for the asXML
call in Listing 14-16 results in clean, nested output:
<?xml version="1.0"?>
<animals>
<cats>
<cat>
<name>teddy</name>
<color>brown</color>
<breed>tabby</breed>
</cat>
</cats>
<dogs>
<dog>
<name>snoopy</name>
<color>brown</color>
<breed>beagle cross</breed>
</dog>
<dog>
<name>jade</name>
<color>black</color>
<breed>lab cross</breed>
</dog>
</dogs>
</animals>
Note SimpleXML can also import DOM objects via the function simplexml_import_dom
.
<?php
error_reporting(E_ALL ^ ~E_STRICT);
$dom_xml = DOMDocument::loadXML("<root><name>Brian</name></root>");
$simple_xml = simplexml_import_dom($dom_xml);
print $simple_xml->name; // brian
?>
In Listing 14-17, we will generate an RSS sample with namespaces and attributes. Our goal will be to output an XML document with the following structure:
<?xml version="1.0" ?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
<channel>
<title>Brian’s RSS Feed</title>
<description>Brian’s Latest Blog Entries</description>
<link>http://www.briandanchilla.com/node/feed </link>
<lastBuildDate>Fri, 04 Feb 2011 00:11:08 +0000 </lastBuildDate>
<pubDate>Fri, 04 Feb 2011 08:25:00 +0000 </ pubDate>
<item>
<title>Pretend Topic </title>
<description>Pretend description</description>
<link>http://www.briandanchilla.com/pretend-link/</link>
<guid>unique generated string</guid>
<dc:pubDate>Fri, 04 Feb 2011 08:25:00 +0000 </dc:pubDate>
</item>
</channel>
</rss>
Listing 14-17. Generating a RSS Document with SimpleXML
<?php
error_reporting(E_ALL);
$items = array(
array(
"title" => "a",
"description" => "b",
"link" => "c",
"guid" => "d",
"lastBuildDate" => "",
"pubDate" => "e"),
array(
"title" => "a2",
"description" => "b2",
"link" => "c2",
"guid" => "d2",
"lastBuildDate" => "",
"pubDate" => "e2"),
);
$rss_xml = new SimpleXMLElement('<rss xmlns:dc="http://purl.org/dc/elements/1.1/"/>'),
$rss_xml->addAttribute('version', '2.0'),
$channel = $rss_xml->addChild('channel'),
foreach ($items as $item) {
$item_tmp = $channel->addChild('item'),
foreach ($item as $key => $value) {
if ($key == "pubDate") {
$tmp = $item_tmp->addChild($key, $value, "http://purl.org/dc/elements/1.1/");
} else if($key == "lastBuildDate") {
//Format will be: Fri, 04 Feb 2011 00:11:08 +0000
$tmp = $item_tmp->addChild($key, date('r', time()));
} else {
$tmp = $item_tmp->addChild($key, $value);
}
}
}
//for nicer formatting
$rss_dom = new DOMDocument('1.0'),
$rss_dom->preserveWhiteSpace = false;
$rss_dom->formatOutput = true;
//returns a DOMElement
$rss_dom_xml = dom_import_simplexml($rss_xml);
$rss_dom_xml = $rss_dom->importNode($rss_dom_xml, true);
$rss_dom_xml = $rss_dom->appendChild($rss_dom_xml);
$rss_dom->save('rss_formatted.xml'),
?>
The main lines in Listing 14-17 are setting the namespace in the root element, $rss_xml = new SimpleXMLElement('<rss xmlns:dc="http://purl.org/dc/elements/1.1/"/>')
, fetching the namespace if the item key is pubDate
and generating an RFC 2822 formatted date if the key is lastBuildDate
.
The contents of the file after running Listing 14-17 will be similar to this:
<?xml version="1.0"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
<channel>
<item>
<title>a</title>
<description>b</description>
<link>c</link>
<guid>d</guid>
<lastBuildDate>Fri, 27 May 2011 01:20:04 +0200</lastBuildDate>
<dc:pubDate>e</dc:pubDate>
</item>
<item>
<title>a2</title>
<description>b2</description>
<link>c2</link>
<guid>d2</guid>
<lastBuildDate>Fri, 27 May 2011 01:20:04 +0200</lastBuildDate>
<dc:pubDate>e2</dc:pubDate>
</item>
</channel>
</rss>
Note More information on SimpleXML can be found at http://php.net/manual/en/book.simplexml.php.
To troubleshoot an XML document that is not validating, you can use the online validator at http://validator.w3.org/check
.
As mentioned at the start of the chapter, SimpleXML is by no means the only option available for XML manipulation in PHP. Another popular XML extension is the DOM. We have already seen that the DOMDocument has some more powerful features than SimpleXML, in its output formatting. The DOMDocument is more powerful than SimpleXML, but as you would expect, is not as straightforward to use.
The majority of the time you would probably choose to use SimpleXML over DOM. However the DOM extension has these additional features:
With SimpleXML, all nodes are the same. So an element uses the same underlying object as an attribute. The DOM has separate node types. These are XML_ELEMENT_NODE
, XML_ATTRIBUTE_NODE,
and XML_TEXT_NODE
. Depending on the type, the corresponding object properties are tagName
for elements, name
and value
for attributes, and nodeName
and nodeValue
for text.
//creating a DOMDocument object
$dom_xml = new DOMDocument();
DOMDocument
can load XML from a string, file, or imported from a SimpleXML
object.
//from a string
$dom_xml->loadXML('the full xml string'),
// from a file
$dom_xml->load('animals.xml'),
// imported from a SimpleXML object
$dom_element = dom_import_simplexml($simplexml);
$dom_element = $dom_xml->importNode($dom_element, true);
$dom_element = $dom_xml->appendChild($dom_element);
To maneuver around a DOM object, you would do so with object oriented calls to functions like:
$dom_xml->item(0)->firstChild->nodeValue
$dom_xml->childNodes
$dom_xml->parentNode
$dom_xml->getElementsByTagname('div'),
There are several save functions available - save
, saveHTML
, saveHTMLFile,
and saveXML
.
DOMDocument has a validate
function to check if a document is legal. To use XPath with the DOM you would need to construct a new DOMXPath
object.
$xpath = new DOMXPath($dom_xml);
To better illustrate the difference between the SimpleXML and DOM extensions, the next two examples using the DOM are equivalent to examples done earlier in the chapter using SimpleXML.
Listing 14-18 outputs all of the animal names and the type of animal in brackets. It is equivalent to Listing 14-9 which used SimpleXML.
Listing 14-18. Finding Elements with DOM
<?php
error_reporting(E_ALL ^ E_NOTICE);
$xml = <<<THE_XML
<animals>
<dog>
<name>snoopy</name>
<color>brown</color>
<breed>beagle cross</breed>
</dog>
<cat>
<name>teddy</name>
<color>brown</color>
<breed>tabby</breed>
</cat>
<dog>
<name>jade</name>
<color>black</color>
<breed>lab cross</breed>
</dog>
</animals>
THE_XML;
$xml_object = new DOMDocument();
$xml_object->loadXML($xml);
$xpath = new DOMXPath($xml_object);
$names = $xpath->query("*/name");
foreach ($names as $element) {
$parent_type = $element->parentNode->nodeName;
echo "$element->nodeValue ($parent_type)<br/>";
}
?>
Notice that in Listing 14-18 that we need to construct a DOMXPath
object and then call its query
method. Unlike in Listing 14-9, we can directly access the parent. Finally, observe that we access node values and names as properties in the previous listing, and through method calls in Listing 14-9.
Listing 14-19 shows how to search for an element value and then find an attribute value of the element. It is the DOM equivalent of Listing 14-13..
Listing 14-19. Searching for Element and Attribute Values with DOM
<?php
error_reporting(E_ALL);
$xml = <<<THE_XML
<animals>
<dog>
<name id="1">snoopy</name>
<color>brown</color>
<breed>beagle cross</breed>
</dog>
<cat>
<name id="2">teddy</name>
<color>brown</color>
<breed>tabby</breed>
</cat>
<dog>
<name id="3">jade</name>
<color>black</color>
<breed>lab cross</breed>
</dog>
</animals>
THE_XML;
$xml_object = new DOMDocument();
$xml_object->loadXML($xml);
$xpath = new DOMXPath($xml_object);
$results = $xpath->query("dog/name[contains(., 'jade')]");
foreach ($results as $element) {
print $element->attributes->getNamedItem("id")->nodeValue;
}
?>
The main thing to note in Listing 14-19 is that with the DOM, we use attributes->getNamedItem("id")->nodeValue
to find the id
attribute element. With SimpleXML, in Listing 14-13, we used attributes()->id
.
The XMLReader and XMLWriter extensions are used together. They are more difficult to use than the SimpleXML or DOM extensions. However, for very large documents, using XMLReader and XMLWriter is a good choice (often the only choice) as the reader and writer are event based and do not require the entire document to be loaded into memory. However, since the XML is not loaded all at once, one of the prerequisites for using XMLReader or XMLWriter is that the exact schema of the XML should be well known beforehand.
You can obtain most values with XMLReader by repeatedly calling read()
, looking up the nodeType
and obtaining the value
.
Listing 14-20 is the XMLReader equivalent of Listing 14-4, which uses SimpleXML.
Listing 14-20. Finding Elements with XMLReader
<?php
error_reporting(E_ALL ^ E_NOTICE);
$xml = <<<THE_XML
<animals>
<dog>
<name>snoopy</name>
<color>brown</color>
<breed>beagle cross</breed>
</dog>
<cat>
<name>teddy</name>
<color>brown</color>
<breed>tabby</breed>
</cat>
<dog>
<name>jade</name>
<color>black</color>
<breed>lab cross</breed>
</dog>
</animals>
THE_XML;
$xml_object = new XMLReader();
$xml_object->XML($xml);
$dog_parent = false;
while ($xml_object->read()) {
if ($xml_object->nodeType == XMLREADER::ELEMENT) {
if ($xml_object->name == "cat") {
$dog_parent = false;
} else if ($xml_object->name == "dog") {
$dog_parent = true;
} else
if ($xml_object->name == "name" && $dog_parent) {
$xml_object->read();
if ($xml_object->nodeType == XMLReader::TEXT) {
print $xml_object->value . "<br/>";
$dog_parent = false;
}
}
}
}
?>
Notice that Listing 14-20 contains no namespace elements or usage of XPath and is still complex.
A useful XMLReader function is expand()
. It will return a copy of the current node as a DOMNode
. This means that you now have access to searching the subtree by tag name.
$subtree = $xml_reader->expand();
$breeds = $subtree->getElementsByTagName('breed'),
Of course, you would only want to do this on subtrees that are not very large themselves. XMLReader and XMLWriter are much more complex than the tree based extensions and have around twenty node types alone. The difficulty of XMLReader and XMLWriter compared to SimpleXML and DOM make it used only when necessary.
XML is a very useful tool to communicate and store data with. The cross language, platform independent nature of XML makes it ideal for many applications. XML documents can range from being simple and straightforward to being very complex in nature, with elaborate schemas and multiple namespaces.
In this chapter we gave an overview of XML. Then we covered parsing and generating XML documents with the SimpleXML extension. SimpleXML makes working with XML very easy while still being powerful. We showed how to find element and attribute values and how to handle namespaces.
Though SimpleXML is the best solution for most documents, alternatives such as the DOM and XMLReader should be used when appropriate. For XHTML documents, DOM makes sense and for very large documents XMLReader and XMLWriter might be the only viable option. In any case, knowledge of multiple XML parsers is never bad.