Processing XML with the DOM, SAX, .NET, XSLT, and XPath

So you have XML, as a syntax, for defining the format of business documents. However, that is not enough; you need other technologies to work with the business documents created using XML. These technologies are DOM, SAX, .NET, XSLT, and XPath.

The Document Object Model (DOM)

You might want to use the DOM (Document Object Model) to place an XML document in memory and manipulate it programmatically. Using the DOM, you can load an XML purchase order into memory and use a program to establish what the purchase order is for, who it needs to go to, where it should be billed, and so on.

The importance of the DOM makes it necessary to look at an example. In this case, we will build an XML document using the DOM and pass it to another ASP (Active Server Page) using the XMLHTTP object available via the MSXML3 parser (see Listing 2.2). This is just a sample of the JavaScript used to manipulate the DOM. To see a working example of this in an actual HTML document, see Listing 2.3:

Note

First, Listing 2.2 is a client-side application that only works with IE 5.0 (and above) with the MSXML parser installed.

Second, Listing 2.2 uses the 3.0 version of the MSXML parser. If you are using an earlier version and choose not to upgrade, simply replace every instance of the ("Msxml2.DOMDocument.3.0") and ("Msxml2.XMLHTTP") object references with the ("Msxml.DOMDocument") and ("Msxml.XMLHTTP") objects, respectively. The latest MSXML parser is free, and it is simple to install. It is available at http://msdn.microsoft.com/xml/general/xmlparser.asp.


Listing 2.2. Parsing XML using the DOM
function buildXML() {
var oXmlDocument = new ActiveXObject("Msxml2.DOMDocument.3.0");
var oXmlRoot = oXmlDocument.createElement("PurchaseOrder");
oXmlDocument.appendChild(oXmlRoot);
var oShipTo = oXmlDocument.createElement("ShipTo");
oXmlRoot.appendChild(oShipTo);
oShipTo.text = "'Wild Meadow Tea Company,' 555 Wellington Way, New City, VA, 55555";
var oBillTo = oXmlDocument.createElement("BillTo");
oXmlRoot.appendChild(oBillTo);
oBillTo.text = "'Wild Meadow Tea Company,' 555 Wellington Way, New City, VA, 55555";
var oItems = oXmlDocument.createElement("Items");
oXmlRoot.appendChild(oItems);
var oItem = oXmlDocument.createElement("Item");
oItems.appendChild(oItem);
oItem.setAttribute("name", "1 Case Chai Tea");
var oHTTP = new ActiveXObject("Msxml2.XMLHTTP");
oHTTP.open ("POST", "http://myServer/application/processXML.asp", false);
oHTTP.send (oXmlDocument.xml);}

In the second line of Listing 2.2, we instantiate the ActiveX object. Two models are available for parsing: the rental-threaded model (which we are using in this code sample) and the free-threaded model. The rental-threaded model is optimal when you don't expect to have more than one person accessing the document at any one time. The free-threaded model is best when multiple users are expected to access the XML enabled application simultaneously.

When you find yourself loading an XML document into memory, you may want to use the async method. This method—when set to false—declares that the processing should be done synchronously, as opposed to asynchronously. It would be issued directly under the second line of Listing 2.2 and look like this:

oXmlDocument.async = false; 

The result is that the script will not continue processing until the entire XML document has been loaded into memory. Because we are creating a document in memory, it is not necessary to call the method. Just remember that errors will occur if processing is continued without a completely downloaded XML instance, simply because you won't have a complete document to process. Once the document is completely downloaded, parsing can continue.

Note

You can use the readyState property to manage asynchronous downloads, but that is beyond the scope of this sample. You can read more about that in the MSXML SDK available from Microsoft.


In the code snippet below, the createElement method is used to generate an element in the DOM. Once generated, the element needs to be appended to the correct parent node. You do this by referencing the parent node object and using the appendChild method. The element is then populated—that is, text is added, using the text method on the newly created element object.

var oShipTo = oXmlDocument.createElement("ShipTo"); 
oXmlRoot.appendChild(oShipTo);
oShipTo.text = "'Wild Meadow Tea Company,' 555 Wellington Way, New City, VA, 55555";

Another useful method is the setAttribute method. Whenever you need to add an attribute to an element, you can use this method. The following code from Listing 2.2 shows this example.

oItem.setAttribute("name", "1 Case Chai Tea"); 

Some other things that you might want to do with the DOM is delete parts of the XML document or manipulate them and save them back to the server. Another method is available, should you decide to save the XML, and that is the save method. You can even save the document to a URL by using the following syntax:

oXmlDocument.save("http://myServer/application/folder/myFile.xml"); 

This saves the DOM document named myFile.xml to a folder named folder.

However, if you want to process the XML document after it is on the server, instead of just saving it to a local directory, you probably will want to use the XMLHTTP object. In the following code snippet (from Listing 2.2), the first line instantiates the XMLHTTP object on the client and provides a way to send an XML document to the server to be processed by an ASP page.

var oHTTP = new ActiveXObject("Msxml2.XMLHTTP"); 
    oHTTP.open ("POST", "http://myServer/application/processXML.asp", false);
    oHTTP.send (oXmlDocument.xml);

The open method enables you to define whether this is an HTTP POST or GET to the server. The method requires the HTTP method, a URL to the application, and a true or false parameter to dictate whether this is an asynchronous transmission. The send method is used to initiate the transmission and deliver the XML to the application.

Look at the same code now, with a few more methods in place, to see how the whole process might work (see Listing 2.3). In this case, Wild Meadow Tea Company is requesting a few items from ACME Tea Company. The idea is that a procurement officer at Wild Meadow will submit the order online and receive a response from ACME regarding the receipt of the order.

Listing 2.3. Generating XML with the DOM
<html>
<head>
<script>
function buildXML() {
var oXmlDocument = new ActiveXObject("Msxml2.DOMDocument.3.0");
oXmlDocument.async = false;
var oXmlRoot = oXmlDocument.createElement("PurchaseOrder");
oXmlDocument.appendChild(oXmlRoot);
var oShipTo = oXmlDocument.createElement("ShipTo");
oXmlRoot.appendChild(oShipTo);
oShipTo.text = "'Wild Meadow Tea Company,' 555 Wellington " +
"Way, New City, VA, 55555";
var oBillTo = oXmlDocument.createElement("BillTo");
oXmlRoot.appendChild(oBillTo);
oBillTo.text = "'Wild Meadow Tea Company,' 555 Wellington " +
"Way, New City, VA, 55555";
var oItems = oXmlDocument.createElement("Items");
oXmlRoot.appendChild(oItems);
var oItem = oXmlDocument.createElement("Item");
oItems.appendChild(oItem);
var oItemName = oItem.setAttribute("name", document.all("item").value);
var oHTTP = new ActiveXObject("Msxml2.XMLHTTP");
oHTTP.open ("POST", "http://localhost/listing2_4.asp", false);
oHTTP.send (oXmlDocument.xml);
alert (oHTTP.responseText);
}
</script>
</head>
<body>
<h1>Online Tea Order</h1>
<input type="text" name="item" />
<input type="button" value="Submit Order" onclick="buildXML();" />
</body>
</html>

Notice in Listing 2.3 that the purchase order is being built dynamically using the DOM. The structure of the XML is created using methods such as createElement, setAttribute, and appendChild. Then text is added to those elements by way of the text property. However, in the case of attributes, notice that the value of the attribute is added as the second parameter in the setAttribute method. The first parameter in this method represents the name of the attribute, whereas the second—as stated before—represents the value of that attribute.

When this file is submitted, using the buildXML() function and the XMLHTTP object, the following ASP page (shown in Listing 2.4) processes the request and responds with an acknowledgment message:

Listing 2.4. Server-Side Response to XML POST
<%@ Language = "JavaScript" %>
<%
var oXmlDocument = Server.CreateObject("Msxml2.FreeThreadedDOMDocument.3.0");
oXmlDocument.async = false;
oXmlDocument.load(Request);
var err = oXmlDocument.parseError;
if (err.errorCode != 0) {
Response.Write ("There was an error processing your order. Please try again!");
} else {
 Response.Write ("Your order has been received.  Thank you!");

}
%>

Notice that the XML request is being captured by the Request object in the ASP page. To load the document, all you need to do—when receiving a POST—is use the load method, as shown here:

oXmlDocument.load(Request); 

From there, you can process the document as required by the business rules. Using BizTalk Server, these rules will be enforced by the components integrated with BizTalk and defined by the visual tools that come as part of the BizTalk package.

This has given you a good feel for the DOM and how to process XML in a simple but efficient manner. The following section describes another XML processor called SAX, the Simple API for XML. However, we will not go into much detail with regards to SAX because it appears that Microsoft may be developing an improved API to be used within the .NET framework.

SAX

Another technology that you might want to use to access particular pieces of an XML business document is SAX (Simple API for XML). SAX is a great way to parse—or process—an XML document without having to load it into memory.

Note

The DOM requires an XML instance to be loaded into memory. When working with large XML documents, this can overburden the application and cause slow results at best. SAX is an event-driven parser that does not require the XML document to be loaded into memory, and thus is optimal for bigger jobs requiring the parsing of larger XML documents.


SAX is an event-driven parser that looks for particular pieces of the XML document—that is, nodes. When it finds what it is looking for, SAX processes it. However, SAX is not capable of manipulating a document in the same fashion as the DOM, so people typically use the DOM if they want to manipulate an XML document, and SAX if they want to simply find what they are looking for in the XML and pass it off to another process.

Note

If you are going to be working with Microsoft's .NET technology, you're going to encounter a few alternative parsing technologies. The DOM will remain, but SAX may become obsolete given that Microsoft has developed another event-driven parser that uses a “pull” model rather than the SAX-related “push” model. Because of this fact, we will leave SAX for now and discuss the .NET model in greater depth.


XML Parsing in .NET

It comes as no surprise for most of you to learn that XML is a core technology in the Microsoft .NET framework, but just how this technology is implemented might still be a mystery. The good thing is that what you just learned about the DOM was not a waste of time. The .NET framework uses the DOM to provide access to data in XML documents and additional classes to read, write, and navigate through those documents. The DOM is also used to map XML to relational data in ADO.NET. We will look at an example of how XML is processed using the DOM in the System.Xml namespace; but before we do, it is important to understand an important aspect of .NET XML parsing.

As mentioned previously, the DOM is great for manipulating smaller XML instances, but when it comes to larger data sets, an event-driven parser that doesn't require you to load XML into memory is an absolute necessity. So, with .NET comes a new set of classes to build on the innovative ideas that SAX brought us, called XmlReader and XmlWriter.

These new classes are both easy to learn and functional. But don't take my word for it; let's take a look at some code. Listing 2.5 is a C# code sample that searches for a Purchase Order number and alerts you when it finds the correct one.

Listing 2.5. Using XmlTextReader
using System;
using System.IO;
using System.Xml;
public class Test2 {
public static void Main() {
Test2 tester = new Test2();
}
public Test2() {
StringReader stream = null;
XmlTextReader reader = null;
try {
String sXmlDoc = "<?xml version="1.0"?>" +
"<PurchaseOrders>" +
"<PurchaseOrder number='2023'>" +
"<Items><Item name='1 Box Chai Tea' /></Items>" +
"</PurchaseOrder>" +
"<PurchaseOrder number='2024'>" +
"<Items><Item name='4 Boxes Ginkgo Tea' /></Items>" +
"</PurchaseOrder>" +
"<PurchaseOrder number='2025'>" +
"<Items><Item name='10 Boxes Ginseng Tea' /></Items>" +
"</PurchaseOrder>" +
"<PurchaseOrder number='2026'>" +
"<Items><Item name='1 Box Green Tea' /></Items>" +
"</PurchaseOrder>" +
"<PurchaseOrder number='2027'>" +
"<Items><Item name='1 Box Black Tea' /></Items>" +
"</PurchaseOrder>" +
"</PurchaseOrders>";
stream = new StringReader(sXmlDoc);
reader = new XmlTextReader(stream);
Console.WriteLine("Parsing . . .");
while (reader.Read()) {
String qs = reader.GetAttribute("number");
if (qs != null) {
if (qs == "2026") {
Console.WriteLine("=====================================
");
Console.WriteLine("Your purchase order number {0} has been found:
", qs);
reader.MoveToElement();
Console.WriteLine("Current PO Data Goes Here...
");
Console.WriteLine("=====================================");
} else {
Console.WriteLine ("Purchase Order Number: {0}", qs);
}
}
}
}
catch (Exception e) {
Console.WriteLine("###Exception: {0}", e.ToString());
}
}
}

The code in Listing 2.5 is simple really, and even if you don't know C#, you can at least get a feel for how you can parse XML in .NET. Realize that this is not the only way we could have done this, but just one of the many possible ways to use .NET to parse an XML document. Let's get into the code.

For the purposes of this section, ignore the lines from using System through try {. At line, String sXmlDoc = "<?xml version="1.0"?>" +, notice that we are generating some XML syntax and throwing it into a string object called “stream” in stream = new StringReader(sXmlDoc);. The XmlTextReader parses this object and looks for a Purchase Order number that corresponds with the search criteria. This query string could easily be submitted through some type of form to get the matching Purchase Order back, but for this sample we just hard-code it in.

To compile the code, use the following statement at the command line. Remember to run it from the same location as your code:

csc /r:System.Xml.dll listing2_5.cs 

When the parser stumbles on the correct Purchase Order number, it alerts us and passes the entire Purchase Order node off to our processing application. This passing off of the node is not shown but would not be difficult to implement using either the DOM or the XmlTextWriter class. The output of this program is shown in Listing 2.6.

Listing 2.6. Result of the XmlTextReader Example
Parsing . . .
Purchase Order Number: 2023
Purchase Order Number: 2024
Purchase Order Number: 2025
===============================================

Your purchase order number 2026 has been found:

Current PO Data Goes Here..

===============================================
Purchase Order Number: 2027

The great thing about this type of processing is that each independent XML node is parsed and forgotten. There is no need to persist the data in memory because the data is no longer required by the application.

You have looked at three different models for parsing XML, but what if you needed to format the XML data into an open format such as HTML or Wireless Markup Language (WML)—, or even a proprietary format such as PDF? XSLT may be your answer, and the next section shows you why.

XSLT and XPath

Apart from parsing an XML document, there is also the need to format and query XML data. You may have heard of XSLT and Xpath mentioned in this context before. XSLT (eXtensible Stylesheet Language for Transformations) is a language that can be used to format and transform XML.

For example, say that you have an XML file that needs to be viewed by two types of devices, a standard PC using a Web browser and a wireless device such as a cell phone or PDA. To accomplish this, you could write two XSLT style sheets—one to transform the XML into HTML and the other to transform it into WML (Wireless Markup Language).

Typically, XSLT is used in accordance with CSS (cascading style sheets). In such cases, XSLT is used to define the structure of the HTML, and CSS takes care of the actual formatting—for example, font sizes, coloration, and so on.

XPath can be used within an XSLT file to extract precise pieces of XML data required by the application. XPath can be thought of as SQL for XML. It is a query language that enables developers to pinpoint the exact path to data in an XML document that an application requires for processing.

Here we look at a sample of how XSLT and XPath can be used to transform XML into HTML. Listing 2.7 is an example of how a news agency might store its data as XML.

Listing 2.7. XML News Document
<?xml version="1.0"?>
<TodaysNews date="20010313">
<Story id="001" title="Techs Rally, But Still A Uncertain Market">
<div>
<p>Technology stocks rebounded in late morning trading on Tuesday as investors
prowled the market in search of bargains, but blue chips slipped as tension
ran high on Wall Street, a day after a gut-wrenching selloff.</p>
p>"The underlying force in this market is the economic slowdown: How long and
how deep?" said Peter Coolidge, managing director of equity trading at Brean
Murray &amp; Co. "It's going to be tough going until the market gets a handle
on that. There will be bumps up along the way, but that's all there will be in
an otherwise down-trending market."</p>
</div>
</Story>
<Story id="002" title="Smithsonian Has Donor for George Washington Portrait">
<div>
<p>The Smithsonian Institution said Tuesday a single donor had come forward to
enable it to buy a famous portrait of President George Washington whose British
owner had demanded $20 million for the painting.</p>
<p>The portrait, painted by Gilbert Stuart in 1796, has been on loan to the
Smithsonian's National Portrait Gallery in Washington from flamboyant British
aristocrat Lord Dalmeny.</p>
</div>
</Story>
</TodaysNews>

Now we will build an XSLT file that can be used to transform this document into HTML. Let's look at the entire XSLT file first (see Listing 2.8) and then break it apart and investigate the style sheet line-by-line.

Listing 2.8. XSLT: XML to HTML
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<html>
<head>
<title>Today's News</title>
</head>
<body>
<h3>Today's News</h3>
<h4>
<script language="JavaScript"><![CDATA[
var sDate = "]]><xsl:value-of select="TodaysNews/@date" /><![CDATA[";
var sYear = sDate.substring(0,4);
var sMonth = sDate.substring(4,6);
var sDay = sDate.substring(6,8);
document.write(sMonth + "/" + sDay + "/" + sYear);
]]></script>
</h4>
<xsl:for-each select="TodaysNews/Story">
<h5><xsl:value-of select="@title" /></h5>
<xsl:copy-of select="div" />
</xsl:for-each>
</body>
</html>
</xsl:template>
</xsl:stylesheet>

If you are new to XSLT, have no fear. Doing basic XML manipulation using the syntax is simple. Don't get me wrong; XSLT can get complex, but we'll just deal with the basic syntax for now.

The first thing to realize is that XSLT is an XML-based language. The style sheet declaration (<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/ XSL/Transform">) tells the processing application that this particular XSLT conforms to the 1999 specification of XSL for Transformations. You can think of this as a form of versioning for XSLT documents.

The line <xsl:template match="/"> defines an XSLT template. A template defines a reusable instance for generating output in whatever format you choose. You only see one template in this document, but you can create as many as you want, depending on the number of nodes in your document. In the case of this document, the template corresponds to the root node of the XML document. This means that for each root node that the XSLT encounters (there is only one root node in every XML document—remember the root node is the top element in an XML document), the instructions between the <xsl:template select="/"> and the </xsl:template> tags must be processed.

Within the template are a number of HTML elements and attributes, but you will also find certain XSL elements interspersed throughout the tags. To account for this, XML uses namespaces to differentiate between different XML vocabularies (see the special section on namespaces later in the chapter). In this case, the xsl: prefix denotes that the tag is an XSLT element and should be processed as such by the XSLT processor. In some cases, an XSLT processor and an XML parser will be separate components; however, the MSXML3 parser does both XML and XSLT processing.

An interesting thing occurs in the line (<script language="JavaScript"><![CDATA[. You should notice something called a CDATA section, which is delimited by the <![CDATA" and "]]> structures. The CDATA structure tells the parser to skip over these “Character Data” sections and not process them in any way. That is why you see the end of the CDATA section var sDate = "]]><xsl:value-of select="TodaysNews/@date" /><![CDATA[";, allowing the value-of tag to be processed and the value inserted into the JavaScript. As soon as the XSLT tag is parsed, we begin the CDATA section again until the script section is complete. The only thing the <xsl:value-of /> tag does is extract the data from a particular node. You define the node by inserting an XPath statement into the select node. For instance:

<xsl:value-of select="TodaysNews/@date" /> 

This statement extracts the value of the date attribute in the TodaysNews element. In the preceding sample, the value of the date attribute is 20010313. We extract that value using the XPath statement and use the JavaScript substring method to break the date apart into a more recognizable format—for example, 03/13/2001.

The first line in the following code snippet (from Listing 2.8) introduces the for-each element in XSLT. The purpose of this element is to loop through a particular element list. In this case, we are looping through every instance of the story node. Every time the processor runs across another <Story> tag, the processor executes the following rules:

<xsl:for-each select="TodaysNews/Story"> 
 <h5><xsl:value-of select="@title" /></h5>
 <xsl:copy-of select="div" />
</xsl:for-each>

The value of the title attribute is inserted into an <h5> element and the entire contents of the XML <div> tag are copied straight to the HTML output. This direct copy is done via the copy-of XSLT element. In this case, not only is the text in the XML <div> tag copied to the HTML output, but the XML <p> tags are also copied. This way, there is no need to reformat the contents under the <div> tag because they can be transferred directly to the HTML output in a clean, concise manner.

It's worth mentioning, though, that this only works for an HTML transformation. Imagine, for example, that you are transforming the XML into WML for display on a wireless device. The WML specification does not allow for <div> tags. Therefore, an XSLT document that is going to be used to transform this XML into WML would need to do something else with the XML <div> tag. For instance, you might change the XML <div> to a WML <p> tag, using the following XSLT:

<xsl:template select="div"> 
 <p><xsl:value-of select="node()" /></p>
</xsl:template>

You would need to call this script from somewhere within the root XSLT template, using the <xsl:apply-templates> element. But after you do, this code extracts the value of the XML <div> tag and includes all its text and child nodes into a WML <p> element. If all this seems a bit confusing (especially if you have not worked with WML), don't worry. After some practice, it will all become much more clear to you. Remember, XSLT is only in version 1.0, so this is a great time to get started with it before it becomes much more complicated.

With these six parts of the XML family—XML 1.0, DOM, SAX, .NET, XSLT, and XPath—you can accomplish almost any task required of an XML application. Using your programming language of choice—for example, JavaScript, Perl, Java, C++, Visual Basic, C#, Python, and so on, all you need to do is find the right parser for the job. The MSXML3 parser is an excellent processor, and Microsoft has done an outstanding job at making this parser conform to W3C standards; however, other excellent ones are available that have been hand-crafted by the XML community to work with every language you can think of. The Java community has perhaps the best arsenal of XML processors on the market, but if you are a Microsoft developer—like me—I highly recommend MSXML 3.0.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset