Chapter 20. XML

WHAT'S IN THIS CHAPTER?

  • Making F# work with XML

  • Using Linq to XML with F#

  • Understanding how F# and DOM work together

  • Applying active patterns for XML parsing

XML is a fact of life for most software developers, regardless of language. It is found everywhere, from configuration files holding nothing but a small number of settings, to the giant stores of weather data provided in XML format by the National Oceanic and Atmospheric Administration of the United States. Although history will judge whether having all this XML was a good idea, the fact remains that it's there. Love it or hate it, developers have to deal with it.

In this chapter, you learn how to use F# with XML processing tools such as those found in the System.Xml namespace in the .NET framework. This chapter covers techniques that simplify most XML processing tasks. Special attention is provided to demonstrate how you can use things such as XPath and active patterns to simplify the task of processing and transforming XML.

OVERVIEW

Boiled down to its essence, most things that deal with XML do the following:

  • Read XML, typically to store it in some sort of language representation

  • Query the XML, often again to extract parts of it into some sort of language representation

  • Process the XML, sometimes to produce more XML or to generate some side effect

  • Persist the XML, so that other programs can join in the fun of working with XML

For more details on the various W3C standards that make up the XML grammar and suite of specifications, we recommend the W3C website.

F# AND LINQ-TO-XML

Of the various means to read and process XML, LINQ-to-XML, introduced in .NET 3.5, is likely the simplest way to query and process XML in F#.

Reading

Imagine writing an application that reads a weather forecast from the Internet and presents an average expected temperature to the user. When using the weather service in the Yahoo developer API to read the weather report, you might request the response from the Yahoo API URL http://weather.yahooapis.com/forecastrss?w=2484280. The request that results from that call reads as follows:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<rss version="2.0" xmlns:yweather="http://xml.weather.yahoo.com/ns/rss/1.0"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#">
<channel>

<title>Yahoo! Weather - Romeoville, IL</title>
<link>http://us.rd.yahoo.com/   dailynews/rss/weather/Romeoville__IL/
*http://weather.yahoo.com/forecast/USIL1019_f.html</link>
<description>Yahoo! Weather for Romeoville, IL</description>
<language>en-us</language>
<lastBuildDate>Sun, 20 Dec 2009 4:44 pm CST</lastBuildDate>
<ttl>60</ttl>
<yweather:location city="Romeoville" region="IL"   country="United States"/>

<yweather:units temperature="F" distance="mi" pressure="in" speed="mph"/>
<yweather:wind chill="27"   direction="0"   speed="0" />
<yweather:atmosphere humidity="83"  visibility="5"  pressure="30.07"  rising="1" />
<yweather:astronomy sunrise="7:15 am"   sunset="4:24 pm"/>
<image>
<title>Yahoo! Weather</title>
<width>142</width>
<height>18</height>
<link>http://weather.yahoo.com</link>
<url>http://l.yimg.com/a/i/us/nws/th/main_142b.gif</url>
</image>
<item>

<title>Conditions for Romeoville, IL at 4:44 pm CST</title>
<geo:lat>41.64</geo:lat>
<geo:long>−88.08</geo:long>
<link>http://us.rd.yahoo.com/dailynews/rss/weather/Romeoville__IL/
*http://weather.yahoo.com/forecast/USIL1019_f.html</link>
<pubDate>Sun, 20 Dec 2009 4:44 pm CST</pubDate>
<yweather:condition  text="Cloudy"  code="26"  temp="27"  date="Sun, 20 Dec 2009
4:44 pm CST" />
<description><![CDATA[
<img src="http://l.yimg.com/a/i/us/we/52/26.gif"/><br />
<b>Current Conditions:</b><br />
Cloudy, 27 F<BR />
<BR /><b>Forecast:</b><BR />
Sun - Light Snow Early. High: 29 Low: 23<br />
Mon - Cloudy. High: 28 Low: 25<br />
<br />
<a href="http://us.rd.yahoo.com/dailynews/rss/weather/Romeoville__IL/
*http://weather.yahoo.com/forecast/USIL1019_f.html">Full Forecast at Yahoo!
Weather</a><BR/><BR/>
(provided by <a href="http://www.weather.com" >The Weather Channel</a>)<br/>
]]></description>
<yweather:forecast day="Sun" date="20 Dec 2009" low="23" high="29"
text="Light Snow Early" code="14" />
<yweather:forecast day="Mon" date="21 Dec 2009" low="25" high="28"
text="Cloudy" code="26" />
<guid isPermaLink="false">USIL1019_2009_12_20_16_44_CST</guid>
</item>

</channel>
</rss><!-- api6.weather.ac4.yahoo.com compressed/chunked Sun Dec 20 15:02:47 PST
2009 -->

Although reading this whole request manually and parsing it might be an interesting exercise, Linq to XML with F# provide us a much simpler means to perform this task. Start off by reading the result of the URL into an XDocument, from Linq to XML:

let weatherXml =
    "http://weather.yahooapis.com/forecastrss?w=2484280" |>
    XDocument.Load

The simplest way to start reading XML when you have a valid XDocument reference is to simply inspect the elements and do something with them:

let readTheElements =
  weatherXml.Elements() |>
  Seq.iter ( fun(e) -> do printfn "%s" e.Value )

The preceding function takes all the Elements at the root of the document (in this case, a single element of an RSS feed at the top level) and prints them out to standard output (in this case, the console). This includes not just those elements at the top level of the document, but all elements, regardless of level. The sequence of elements is passed to Seq.iter, which applies the function to each XElement in the sequence. The function being applied to each element takes the element, prints it to the screen, and returns unit.

Of course, the question frequently comes up about what the order of the elements will be when doing queries against XML in this manner. In the aforementioned example, order is not relevant to the task, so it is not specified. That said, because the XML order is usually controlled by implementation-specific factors at the site that produces the XML, it is usually a bad idea to depend on implicit order from the document. Should we want to control the order of processing, the preferred approach is to take the sequence that results from Elements() and process the results through Seq.orderBy with the appropriate explicit ordering function.

Although we can certainly do a lot of interesting work simply reading XML and processing it somehow, the real power of LINQ to XML is in the query capability. In fact, if you are simply reading XML and doing your own queries using for loops and if statements, you are not really leveraging the power of F# or LINQ to XML to make your code more readable and efficient.

Querying

LINQ-to-XML was built to make code that queries XML documents more readable. Recall the previous short snippet of F# that accesses the URL that contains an upcoming weather forecast and then parses it into an XML document for further use:

let weatherXml =
    "http://weather.yahooapis.com/forecastrss?w=2484280" |>
    XDocument.Load

The preceding statement creates a weatherXml function that returns an XDocument representing a weather forecast for Romeoville, Illinois. XDocument is the root of most LINQ-to-XML operations and has factory methods to create an XDocument instance from a variety of different sources:

METHOD

PURPOSE

XDocument.Load

Create an XDocument based on a URI, TextReader, XmlReader, or stream.

XDocument.Parse

Create an XDocument based on a string of XML content.

The resulting XDocument provides a base from which further queries can be performed. For example, to produce a sequence of all elements that have forecast as the element name in the local namespace, we apply a filter to the descendant XElements of the XDocument, using the Seq.filter higher-order function:

let weatherXml =
    "http://weather.yahooapis.com/forecastrss?w=2484280" |>
    XDocument.Load
let forecastElements =
    weatherXml.Descendants()
    |> Seq.filter( fun(e) -> e.Name.LocalName = "forecast" )
                                                                    
Querying

The Descendants() method on the object returned from weatherXml returns a sequence of all the XElements below that given element. When called on an XDocument, the sequence represents all the XElements in the particular XDocument.

If we then assume that we want to find all the attributes of all the forecast elements, this is done easily by pipelining two more lines of code:

let weatherXml =
    "http://weather.yahooapis.com/forecastrss?w=2484280" |>
    XDocument.Load
let allForecastAttributes  =
    weatherXml.Descendants()
    |> Seq.filter( fun(e) -> e.Name.LocalName = "forecast" )
    |> Seq.collect ( fun(e) -> e.Attributes() )
                                                                    
Querying

The Seq.map gathers the attributes from each element, which yields a result of a sequence of sequences with all the attributes of various forecast elements. A good way to think about the result of that line is to imagine a series of piles of attributes, each pile relating to an element from which it came. The next line, Seq.concat, puts all the attributes into a single sequence of attributes, stacking all the separate "piles" into a single stack of attributes.

From here, we may be interested only in attributes that relate to a low or high temperature in a forecast. To gather those attributes, we add the following:

let weatherXml =
    "http://weather.yahooapis.com/forecastrss?w=2484280" |>
    XDocument.Load
let allHighAndLowAttributes =
    weatherXml.Descendants()
    |> Seq.filter( fun(e) -> e.Name.LocalName = "forecast" )
    |> Seq.collect ( fun(e) -> e.Attributes() )
    |> Seq.filter( fun(a) -> a.Name.LocalName = "high" ||
                             a.Name.LocalName = "low" )
                                                                       
Querying

The function allHighAndLowAttributes returns a sequence of XAttribute objects, namely, attributes named "high" and "low", for computing the expected average temperature over all the forecasts provided by the Yahoo weather service. From here, if we want to compute the average, we simply need to take the values from string into double and then compute the average:

let weatherXml =
    "http://weather.yahooapis.com/forecastrss?w=2484280" |>
    XDocument.Load
let averageForecastTemp =
    weatherXml.Descendants()
    |> Seq.filter( fun(e) -> e.Name.LocalName = "forecast" )
    |> Seq.collect ( fun(e) -> e.Attributes() )
    |> Seq.filter( fun(a) -> a.Name.LocalName = "high" ||
                             a.Name.LocalName = "low" )
    |> Seq.map ( fun(a) -> a.Value |> Double.Parse )
    |> Seq.average
                                                                       
Querying

Two statements are added here. First, the Seq.map statement takes the Value property of each attribute and parses it into a double. Note that there is an assumption here that the attribute value is numeric for the sake of clarity; we could also use Double.TryParse to filter out any attributes with a non-numeric value.

The second statement added takes the resulting doubles and produces the average using Seq.average.

Of course, the readability and reusability of this leaves much to be desired, so a quick refactoring where the functions are more properly named and where the magic number in the URI is removed reads as follows:

let getYahooWeather id =
  let weatherXml =
    sprintf "http://weather.yahooapis.com/forecastrss?w=%d" id |>
    XDocument.Load
  let byForecastElements (e:XElement) =
    e.Name.LocalName = "forecast"
  let elementToAttributes (e:XElement) =
    e.Attributes()
  let byHighAndLowAttributes (a:XAttribute) =
    a.Name.LocalName = "high" || a.Name.LocalName = "low"
  let attributeValueToDouble (a:XAttribute) =
    a.Value |> Double.Parse

  let averageForecastTemp =
    weatherXml.Descendants()
    |> Seq.filter byForecastElements
    |> Seq.collect elementToAttributes
    |> Seq.filter byHighAndLowAttributes
    |> Seq.map attributeValueToDouble
    |> Seq.average
  averageForecastTemp

let avgRomeovllTmp = 2484280 |> getYahooWeather
printfn "Avg Forecast Temp for Romeoville, IL is %g" avgRomeovllTmp
                                                                       
Querying

Although this last step does not add much functionality to the query, it does make it simpler to understand what the query is doing. It also provides a means to perhaps compose other queries out of the parts that were used to build this one.

Processing

Of course, simply taking a forecast from a source and republishing it is not terribly useful. Many applications that deal with XML want to process the XML in some way that adds value to the original document.

In this use case, the goal is to take the weather information from Romeoville, IL, and add some useful information about such a "vacation destination." Let's start again by writing a function that creates the weather XDocument based on the Yahoo "Where on Earth ID" of a locality:

let getYahooWeather id =
    sprintf "http://weather.yahooapis.com/forecastrss?w=%d" id |>
    XDocument.Load

To add the information, we need to create an XElement that represents information about a community:

let makeCommunityInfoElement (s:string) =
    new XElement(XName.Get("communityinfo"),s)

One notable thing about this statement is that if s is not constrained to string, the XElement constructor will not determine which constructor overload to call. This is because there are two constructors that have an XName as a first parameter combined with a second parameter.

When we have these tools, all that is needed is a function that will insert an XElement in some location in the document that makes sense. If what is wanted is an element to be inserted after all the individual forecast elements, write the following:

let insertCommunityInfo (doc:XDocument) (commInfo:XElement) =
  let last sequence =
    sequence
    |> Seq.skip( (sequence |> Seq.length) - 1 )
    |> Seq.head
  let lastForecast =
     doc.Descendants()
     |> Seq.filter (fun(e) -> e.Name.LocalName = "forecast")
     |> last
  do commInfo |> lastForecast.AddAfterSelf
                                                                   
Processing

One thing that is quickly found is that there is no last method on seq<a'>, so one will need to be written. Doing so is a matter of skipping to the next to last item using Seq.skip, passing it a length based on Seq.length - 1, and then taking the Seq.head of the resulting one item sequence.

When a workable implementation of last is written, we can much more easily get the last forecast by taking all elements via document.Descendants(), filtering by forecast, and taking the last item using last.

With lastForecast, the next step is to take the communityInfo XElement and pass it to the AddAfterSelf method. In this case, the do syntax is used to more clearly specify that what follows will cause a side effect. Although we could achieve the same result by passing the result of AddAfterSelf to ignore, using the do syntax more explicitly signals the reader of the code that we are doing something that causes a side effect, namely, mutating the XML structure by adding an element.

Next, put this all together using the following:

let townDoc = 2484280 |> getYahooWeather
let townInfo =
    "Fine Bedroom Community with Two Sushi Bars" |>
    makeCommunityInfoElement

do insertCommunityInfo townDoc townInfo

Here the code is simply getting an XDocument to manipulate, getting an XElement to add to the tree, and then affixing the new XElement to the XDocument using insertCommunityInfo. Again, the do statement is used to indicate that we expect the line to cause a side effect.

This is just one example of many things you can do to process XML. The following methods can also be called to provide various means of processing the XML tree.

METHOD

PURPOSE

http://msdn.microsoft.com/

Add

Adds the specified content as a child

AddAfterSelf

Adds the specified content immediately after this node

AddAnnotation

Adds an object to the annotation list

AddBeforeSelf

Adds the specified content immediately before this node

AddFirst

Adds the specified content as the first child

Remove

Removes this node from its parent

RemoveAll

Removes all nodes and attributes

RemoveAnnotations

Removes all annotations

RemoveAttributes

Removes all attributes

RemoveNodes

Removes all nodes

ReplaceAll

Replaces the child nodes and the attributes of this element with the specified content

ReplaceAttributes

Replaces the attributes of this element with the specified content

ReplaceNodes

Replaces the children nodes with the specified content

ReplaceWith

Replaces this node with the specified content

SetAttributeValue

Sets the value of an attribute, adds an attribute, or removes an attribute

SetElementValue

Sets the value of a child element, adds a child element, or removes a child element

SetValue

Sets the value of this element

Of course, the standard warning about programming with side effects is true for any code where you are mutating the XML document using these methods. Using these methods with async workflows or any other technology like PLinq that may do the operations in a different order can result in bugs that are hard to re-create.

Writing

In the .NET framework, writing XML uses similar techniques, nearly all of which involve having a method that does XML serialization using the XmlWriter class. The simplest way to write out an XML file would be to simply do the following, given an XDocument named townDoc:

let writeToXml (doc:XDocument) (file:string) =
  use xmlWriter = file |> XmlWriter.Create
  do doc.WriteTo xmlWriter

do writeToXml townDoc "yourOutput.xml"

Running this code will serialize the contents of townDoc to a copy of yourOutput.xml in the current directory, overwriting the file if it already exists. One particularly important step here is to use the use binding with the xmlWriter so that the xmlWriter object will be closed when writeToXml completes, thereby flushing the underlying stream buffer.

Note

Note that none of these examples worry about handling exceptions. Because these examples have nothing they can do in response to any thrown exception (such as a file being locked that we want to write to), it is assumed that the exceptions will be handled by something further up the call stack.

Writing XML to Memory or Other Stream-Based Resources

Frequently, the need arises to serialize to something other than a file. The following code will serialize a document to a MemoryStream:

let writeToMemory (doc:XDocument) =
  use stream = new MemoryStream()
  use xmlWriter = stream |> XmlWriter.Create
  do doc.WriteTo xmlWriter
  //do something with the memory stream

do writeToMemory townDoc
                                                        
Writing XML to Memory or Other Stream-Based Resources

In the preceding case where we are writing to memory, we can create a stream (note the use of use binding again), create an XmlWriter using the stream, and use the same WriteTo method with the resulting XmlWriter. The only difference here when compared to writing XML to a file is that we explicitly create the kind of stream we want to write to. Most other types of XML output will work in a similar fashion.

F# AND XML DOM

F# just as easily supports XML DOM, and for that matter, any other .NET-based API for working with XML content. Although LINQ-to-XML is certainly convenient, many developers (or their managers!) prefer to stick with DOM because of the status of XML DOM as a W3C standard.

Note

Note that, technically, Microsoft's implementation of DOM isn't 100% compliant with the W3C DOM specification, because it takes a few of the DOM APIs and transforms them into something more C#/.NET-appropriate. Having said that, conceptually, they are identical.

Reading

Reading XML using XML DOM in F# is somewhat similar to the way you would do it using LINQ-to-XML:

let getYahooWeatherDOM id =
  use xmlReader =
    sprintf "http://weather.yahooapis.com/forecastrss?w=%d" id |>
    XmlReader.Create
  let xmlDoc = new XmlDocument()
  do xmlReader |> xmlDoc.Load
  xmlDoc
                                                         
Reading

A key difference between the APIs is that the XmlDocument object itself is explicitly created, rather than using a factory method like we did with Linq to XML. The XmlReader object is also created separately. The load method of the XmlDocument then takes an XmlReader as a parameter, populating the XmlDocument.

Of course, there are other means of loading an XmlDocument object, such as the following:

METHOD

PURPOSE

XmlDocument.Load

Create an XDocument based on a file, TextReader, XmlReader, or stream.

XmlDocument.LoadXml

Create an XDocument based on a string of XML content. Note that LoadXml is not part of the DOM standard.

To start simply reading from the document, run the following:

let weatherDom = 2484280 |> getYahooWeatherDOM
do weatherDom.SelectNodes("*") |>
Seq.cast<XmlNode> |>
Seq.iter ( fun(e) -> do printfn "%s" e.Value )

There are a couple complexities that emerge when reading via XML DOM. The first complexity is that rather than the simple act of reading, XML means we are going to work with XPath, a standard query language for working with XML. The SelectNodes method of an XmlDocument object takes an XPath query as a parameter. The query for seeking all elements that are at the top level of a document is a simple wildcard ("*"), which is used above with SelectNodes.

Another complexity when dealing with DOM is that the results of XPath queries do not support the IEnumerable<T> interface, which is required to use most Seq methods to interact with the document. Should we want to use Seq methods, the result of any SelectNodes query has to be passed to Seq.cast<XmlNode>, which takes an IEnumerable and produces an IEnumerable<T>. Note that this will fail if any objects in the source collection do not derive from XmlNode; thankfully, everything that returns from SelectNodes does in fact derive from XmlNode, making this not an issue in this case.

Querying

Going further with querying using XML DOM involves getting more familiar with XPath syntax. In the prior querying example with Linq to XML, solving the problem required retrieval of all the attributes with values equal to high and low in the document. Below are some queries that demonstrate different means of querying for the attributes of interest, including the all high or low query that is useful for solving the problem:

let weatherDom = 2484280 |> getYahooWeatherDOM
let allAttributesInTheDocument = weatherDom.SelectNodes("//@")
let allLowAttributes = weatherDom.SelectNodes("//@low")
//the query we really want...
let allLowOrHighAttributes =
    weatherDom.SelectNodes("//@low | //@high")
                                                          
Querying

Here are three different queries — all of which demonstrate, with increasing precision, ways to reach the attributes we are looking for. The "//" string tells XPath to recurse through the entire document hierarchy. The "@" symbol then tells XPath to look for attributes. In the second query, we further specify the name of the attributes we are looking for (low, in this case). The "//@low | //@high" query specifies both high and low attributes.

Recall that the problem criteria require retrieval of the average temperature. To get the average temperature again, do the following:

let weatherDom = 2484280 |> getYahooWeatherDOM
let allLowOrHighAttributes =
  weatherDom.SelectNodes("//@low | //@high")
let nodeValueToDouble (n:XmlNode) =
  n.Value |> Double.Parse
let averageTemp =
  allLowOrHighAttributes
  |> Seq.cast<XmlNode>
  |> Seq.map nodeValueToDouble
  |> Seq.average
                                                  
Querying

Like before when the solution was to simply read elements, this solution requires casting the result of the XPath query using Seq.cast<XmlNode>, so you can do further Seq operations on it. When casted to XmlNode, the next step is to cast the attribute strings that represent the temperatures into double (using our previous nodeValueToDouble function). When converted into a sequence of doubles, the average can be computed using Seq.average.

Processing

XML DOM provides plenty of means to manipulate XML documents. In the previous example for LINQ-to-XML, information was added to the weather forecast. The process is similar in DOM, which starts by creating the XmlNode that should be added to the DOM:

let communityInfo = weatherDom.CreateElement("communityinfo")
do communityInfo.InnerText <- "Fine Community with Two Sushi Bars"

A key difference in DOM is that creation of elements in DOM happens via the parent document you want to create the element in, which is what is done in the previous weatherDom.CreateElement("communityinfo"). Setting the content is a separate line of code where the InnerText property is mutated to contain the content that we want.

let insertCommunityInfoDom (doc:XmlDocument) (commInfo:XmlNode) =
  let last sequence =
    sequence
    |> Seq.skip( (sequence |> Seq.length) - 1 )
    |> Seq.head
  let lastForecast =
    doc.SelectNodes("//*[local-name()='forecast']")
    |> Seq.cast<XmlNode>
    |> last
  do lastForecast.ParentNode.InsertAfter(commInfo,lastForecast) |>
  ignore
do insertCommunityInfoDom weatherDom commInfo
                                                             
Processing

The insertion routine works a bit differently as well. The implementation of last from the LINQ-to-XML example is borrowed (see the prior section on LINQ-to-XML). That is where the similarity ends though. Getting the last forecast element is a bit more involved, as there is a need to start by passing an XPath query into the XmlDocument that specifies forecast elements.

Note that the forecast elements are actually in the yweather namespace. In the LINQ-to-XML example, queries are easily based on element.Name.LocalName, because the name is scoped to the yweather namespace. When using XPath, we have to use the XPath local-name() function to achieve a similar result. We could also do work to attach a namespace to the SelectNodes query — however, it would be quite a bit more work to do so, and is probably unnecessary here.

When the weather nodes have been retrieved, the next step is a familiar cast to a sequence of XmlNode using Seq.cast and then grabbing the last element. When the last element is found, call InsertAfter on the ParentNode, whatever that might be, and pass it the new XmlNode, as well as the lastForecast node that is the node that will be appended. Although this call returns the XmlNode back, because nothing further is needed from the XmlNode, it can be safely passed to ignore.

Writing

Writing XML out to a file, or a stream, using XML DOM is as simple, if not simpler, than LINQ-to-XML. To save to a file, it is as simple as calling the Save method, passing it a filename or a fully qualified path:

weatherDom.Save("yourOutput.xml")

Should you want to save to a stream, be it a MemoryStream or any other object that inherits from Stream, pass the Stream object as well:

weatherDom.Save(someStream)

It is notable that writing XML out largely builds on the .NET IO capabilities, nicely making sure that XML processing concerns are separated from concerns related to how such XML is persisted. This general feature helps make sure that any code you write to deal with XML is very much focused on XML processing and not IO concerns.

F#, XML, AND ACTIVE PATTERNS

If we remove the fascination with angle-brackets and structure and look carefully at an XML document, a curious thing emerges: XML looks very much like name-value pairs, name-value-attribute triplets (where the attributes are a list of name-value pairs themselves), name-value-children triplets (where children are another list), or name-value-attribute-children quads. In other words, from a certain point of view, XML documents are basically lists of tuples, nestled in a hierarchical relationship.

Given this perspective, and the fact that active patterns (described in detail in Chapter 6) are used in F# to do "data decomposition," it seems reasonable to expect that active patterns can help break XML trees down into more palatable data structures that F# can process more easily. And as it turns out, it is but requires a slightly different approach to querying and processing than what the imperative programmer is used to.

Let's work with a slightly different XML model, one that's a bit simpler than the XML returned by the weather service (and, arguably, more like the XML that flies around inside a corporate intranet), that describes various famous (and/or infamous) characters:

<data>
    <item>
        <person gender="male">
            <name>Ted Neward</name>
            <age>38</age>
            <languages>
                <language>English</language>
                <language>French</language>
                <language>C#</language>
                <language>Java</language>
                <language>F#</language>
                <language>Scala</language>
            </languages>
        </person>
</item>
    <item>
        <person gender="male">
            <name>Han Solo</name>
            <age>35</age>
            <languages>
                <language>Imperial Standard</language>
                <language>Wookiee</language>
            </languages>
        </person>
    </item>
    <item>
        <person gender="male">
            <name>Gaius Baltar</name>
            <age>35</age>
            <languages>
                <language>Colonial English</language>
                <language>Cylon</language>
            </languages>
        </person>
    </item>
</data>
                                                  
F#, XML, AND ACTIVE PATTERNS

The goal here is to use active patterns to break this document down into the repetitive person structure that appears repeatedly and transform it into a form more easily used within an F# program.

Recall from the discussion on active patterns that three basic forms of active patterns are available: the single-case active pattern, which converts data from one form to another; the partial-case active pattern, which helps to match when data conversion failures are possible or likely; and the multi-case active pattern, which can take the input data and break it down into one of several different data groupings. Although only one structure appears in the preceding example (the person structure), it remains a reasonable assumption to imagine that other data structures can, will, or do appear in the document later. This implies that either the partial-case or the multi-case active pattern will be best suited for extracting the data out of the document; the decision between the two will rest on whether the F# programmer believes they know the full set of data types that the document contains.

This is not a casually discarded decision — XML documents are often used where a certain amount of ambiguity in the data is expected or desired. Yet, much of the XML sent back and forth between organizations is intended to be a closed-set of data types nestled in between angle-brackets, with unrecoverable errors thrown when an unknown XML document is received. If ambiguity is expected or desired, then the partial-case should be considered, and if not, then the multi-case active pattern becomes the weapon of choice.

Just for pedagogical purposes, both approaches are considered.

Multi-case Active Patterns

The multi-case active pattern requires a single function, written in "banana clips" style, which contains all the possible atoms that the XML document can be decomposed into. For an easy start, consider an active pattern that breaks the document down into Node and Leaf elements, showing the basic tree structure of a document:

let (|Node|Leaf|) (node : #System.Xml.XmlNode) =
    if node.HasChildNodes then
        Node (node.Name, seq { for x in node.ChildNodes -> x })
    else
        Leaf (node.InnerText)

Because the parameter to this active pattern can be either an XmlNode or any of its subtypes, the type descriptor is prefixed with a "#" to indicate subtype availability (as described in Chapter 9).

Using this pattern-match rule in a pattern-match statement becomes relatively trivial, allowing us to print the contents of any XML document in nicely indented form:

let printXml node =
    let rec printXml indent node =
        match node with
        | Leaf (text) ->
            printfn "%s%s" indent text
        | Node (name, nodes) ->
            printfn "%s%s:" indent name
            nodes |> Seq.iter (printXml (indent+"  „))
    printXml "" node
                                                                
Multi-case Active Patterns

As might well be predicted, because the tree structure responds so well to a recursive-descent traversal through the nodes of the tree, the outer printXml function is made up of an inner recursively aware function to do the actual work, threading an "indent" string (made up of nothing but whitespace) through the descent to give nicely formatted text printed to the console.

Of course, a breakdown of leafs and nodes isn't itself useful; more useful would be to extract the <person> elements and their children into an easy-to-use structure in F#. Given the relatively simple structure of the <person> element, it's easiest to imagine the data extracted as a tuple, specifically a string * string * int * seq<string> tuple type, representing the person's gender, name, age, and the list of languages they speak. Extracting this via an active pattern would thus look like:

let (|Node|Leaf|Person|) (node : #System.Xml.XmlNode) =
        if node.Name = "person" then
            let pGender = node.Attributes.ItemOf("gender").Value
            let pName = node.Item("name").InnerText
            let pAge = Int32.Parse(node.Item("age").InnerText)
            let pLangNode = node.Item("languages")
            let pLangs =
                seq{ for l in pLangNode.ChildNodes -> l.InnerText }
            Person (pName, pGender, pAge, pLangs)
        else if node.HasChildNodes then
            Node (node.Name, seq { for x in node.ChildNodes -> x })
        else
            Leaf (node.InnerText)
                                                         
Multi-case Active Patterns

This can then be used to get the various "parts" out of the XML via traditional pattern-match construct:

let printXml node =
        let rec printXml indent node =
            match node with
            | Person (n, g, a, ls) ->
                printfn "%sPerson: %s, %s, %d, speaks %d langs"
                    indent n g a (Seq.length ls)
            | Leaf (text) ->
                printfn "%s%s" indent text
            | Node (name, nodes) ->
                printfn "%s%s:" indent name
                nodes |> Seq.iter (printXml (indent+"  "))
        printXml "" node
    printXml xmlDoc
                                           
Multi-case Active Patterns

The Leaf clause from the pattern-match and active pattern rule can be removed if needed, but the Node clause is going to have to stay, unless specific rules to match on the DocumentElement that forms the root XmlNode of an XmlDocument are written. In general, it seems prudent to keep the Node clause around, with a match that forces the recursive descent further into the tree:

let (|Node|Person|) (node : #System.Xml.XmlNode) =
        if node.Name = "person" then
            let pGender = node.Attributes.ItemOf("gender").Value
            let pName = node.Item("name").InnerText
            let pAge = Int32.Parse(node.Item("age").InnerText)
            let pLangNode = node.Item("languages")
            let pLangs =
                seq{ for l in pLangNode.ChildNodes -> l.InnerText }
            Person (pName, pGender, pAge, pLangs)
        else if (node.HasChildNodes) then
            Node (seq { for x in node.ChildNodes -> x})
        else
            failwith ("Unexpected data: " + node.ToString())
    let printXml node =
        let rec printXml node =
            match node with
            | Node (nodes) ->
                nodes |> Seq.iter printXml
            | Person (n, g, a, ls) ->
                printfn "Person: %s is %s, %d, " +
                    "and speaks %d languages"
                    n g a (Seq.length ls)
        printXml node
    printXml xmlDoc
                                                   
Multi-case Active Patterns

Typically, printing to the console is only done during development and debugging — most of the time, it is more useful to pull the data out of the XML document as a sequence of tuples or other strongly typed objects. This means rewriting the pattern-match itself to return a sequence of tuples:

let extract node =
        let rec extract node =
            match node with
            | Person (n, g, a, ls) ->
                Seq.singleton (n, g, a, ls)
            | Node (nodes) ->
                Seq.collect (fun (n) -> extract n) nodes
        extract node
    let results = extract xmlDoc
    for r in results do
        Console.WriteLine("Result: {0}", r)
                                                          
Multi-case Active Patterns

Of course, after a certain point, tuples may want to become fully fledged domain objects:

type Person(name : string, gender : string,
            age : int, langs : seq<string>) =
    member p.Name with get() = name
    member p.Gender with get() = gender
    member p.Age with get() = age
    member p.Languages with get() = langs
    override p.ToString() =
        String.Format("[Person: {0} is {1}, {2}," +
            ", and speaks {3}]",
            name, gender, age.ToString(),
                (Seq.reduce (fun (l) (s) ->
                    l + ", and " + s) langs))
                                                   
Multi-case Active Patterns

When that happens, the active-pattern rule takes that into account, returning a Person object instead of a tuple:

let (|Node|Person|) (node : #System.Xml.XmlNode) =
        if node.Name = "person" then
            let pGender = node.Attributes.ItemOf("gender").Value
            let pName = node.Item("name").InnerText
            let pAge = Int32.Parse(node.Item("age").InnerText)
            let pLangNode = node.Item("languages")
            let pLangs =
                seq{ for l in pLangNode.ChildNodes -> l.InnerText }
            Person (new Person(pName, pGender, pAge, pLangs))
        else if (node.HasChildNodes) then
            Node (seq { for x in node.ChildNodes -> x})
        else
            failwith ("Unexpected data: " + node.ToString())
                                                               
Multi-case Active Patterns

which in turn makes the transformation from XML to a sequence of domain objects just a bit different:

let extract node =
        let rec extract node =
            match node with
            | Person (p) ->
                Seq.singleton p
            | Node (nodes) ->
                Seq.collect (fun (n) -> extract n) nodes
        extract node
    let results = extract xmlDoc
    for r in results do
        Console.WriteLine("Result: {0}", r)
                                                        
Multi-case Active Patterns

The results of the extract function will be a seq<Person>, which is about as straightforward an extraction result as the F# programmer could want. Things get a tad more interesting (thanks to F#'s type-inference) when more than one domain object can appear in the XML; the F# type-inferencer right now assumes that extract produces a sequence of Person objects out of the XML document. If a new domain type is introduced into the system, such as:

type Ship(name : string, jumpCapable : bool) =
    member s.Name with get() = name
    member s.Jump with get() = jumpCapable
    override s.ToString() =
        String.Format("[Ship: {0}, jump={1}]",
            name, jumpCapable.ToString())
                                               
Multi-case Active Patterns

then extracting it from the XML is ridiculously simple, as we'd hope:

let (|Node|Person|Ship|) (node : #System.Xml.XmlNode) =
        if node.Name = "person" then
            let pGender = node.Attributes.ItemOf("gender").Value
            let pName = node.Item("name").InnerText
            let pAge = Int32.Parse(node.Item("age").InnerText)
            let pLangNode = node.Item("languages")
            let pLangs =
                seq{ for l in pLangNode.ChildNodes -> l.InnerText }
            Person (new Person(pName, pGender, pAge, pLangs))
        else if (node.Name = "ship") then
            let sName = node.Item("name").InnerText
            let sJump = node.Attributes.ItemOf("jump").Value
            Ship (new Ship(sName,
                if sJump="true" then true else false))
        else if (node.HasChildNodes) then
            Node (seq { for x in node.ChildNodes -> x})
        else
            failwith ("Unexpected data: " + node.ToString())
                                                             
Multi-case Active Patterns

But the pattern-match rule has to change slightly; if the Ship clause is simply inserted into the pattern-match, the compiler complains:

let extract node =
        let rec extract node =
            match node with
            | Person (p) ->
                Seq.singleton p
            | Ship (s) ->
                Seq.singleton s
            | Node (nodes) ->
                Seq.collect (fun (n) -> extract n) nodes
        extract node
    let results = extract xmlDoc
    for r in results do
        Console.WriteLine("Result: {0}", r)
                                                      
Multi-case Active Patterns

specifically, that the Ship clause doesn't return a Person object. This is because the type-inferencer in F# has assumed that the extract function wants to take in an XmlNode and return a sequence of Person objects, which obviously the Ship object isn't. If Ship inherits from Person, then obviously the compiler will be OK with this, but as written right now, Ship doesn't.

Fortunately, Ship and Person do both inherit from a common base class, System.Object, so it's simply a matter of telling the F# compiler this, doing the upcast from the domain object to System.Object during the pattern-match, and asking the compiler to see the result as a sequence of Object rather than a sequence of Person:

let extract node : seq<obj> =
        let rec extract node : seq<obj> =
            match node with
            | Ship (s) ->
                Seq.singleton (s :> obj)
            | Person (p) ->
                Seq.singleton (p :> obj)
            | Node (nodes) ->
                Seq.collect (fun (n) -> extract n) nodes
        extract node
    let results = extract xmlDoc
    for r in results do
        Console.WriteLine("Result: {0}", r)
                                                  
Multi-case Active Patterns

And now, any number of domain types can be added to the active pattern rule and returned from the extract function.

Unfortunately, the drawback to the multi-case solution comes when the upstream source of the XML document throws something "new" into the XML stream. Not that clients would actually ever do that, of course, but still, a more robust solution would allow for a certain amount of forgiveness.

Partial-Case Active Patterns

Operating on a slightly different XML example from before, we can introduce some "unknown" structure into the XML document that is to be parsed and extracted into domain objects:

<data>
    <item>
        <person gender="male">
            <name>Ted Neward</name>
            <age>38</age>
            <languages>
                <language>English</language>
                <language>French</language>
                <language>C#</language>
                <language>Java</language>
                <language>F#</language>
                <language>Scala</language>
            </languages>
        </person>
    </item>
    <item>
        <person gender="male">
            <name>Han Solo</name>
            <age>35</age>
            <languages>
                <language>Imperial Standard</language>
                <language>Wookiee</language>
            </languages>
        </person>
    </item>
    <ship jump="true">
        <name>Millenium Falcon</name>
    </ship>
    <fairyTalePrincess>
        <name>Sleeping Beauty</name>
        <ending>Happy</ending>
    </fairyTalePrincess>
    <fairyTalePrincess>
        <name>Cinderella</name>
        <ending>Happy</ending>
    </fairyTalePrincess>
    <item>
        <person gender="male">
            <name>Gaius Baltar</name>
            <age>35</age>
            <languages>
                <language>English</language>
                <language>Cylon</language>
            </languages>
        </person>
    </item>
<ship jump="true">
        <name>Galactica</name>
    </ship>
</data>
                                                 
Partial-Case Active Patterns

Where'd those fairyTalePrincess elements come from? Clearly, as the children's television show implied, they are the "one of these things that doesn't belong," but what can we do? Clients sometimes don't send the data that is expected.

The partial-case active pattern requires a function for each "thing" that the XML might be extracted into, with a wildcard at the end of the name to indicate that this might not always succeed, and this is where the partial-match will be of better benefit. Because the partial-match pattern doesn't assume that it has all the possible cases the source (the XML node) can extract into, it will neatly and efficiently bypass any source that it doesn't understand.

To start, create the partial-match active pattern rules for the two types we do know about, Person and Ship:

let (|Person|_|) (node : #System.Xml.XmlNode) =
        if node.Name = "person" then
            let pGender = node.Attributes.ItemOf("gender").Value
            let pName = node.Item("name").InnerText
            let pAge = Int32.Parse(node.Item("age").InnerText)
            let pLangNode = node.Item("languages")
            let pLangs =
                seq{ for l in pLangNode.ChildNodes -> l.InnerText }
            Some(new Person(pName, pGender, pAge, pLangs))
        else
            None

    let (|Ship|_|) (node : #System.Xml.XmlNode) =
        if (node.Name = "ship") then
            let sName = node.Item("name").InnerText
            let sJump = node.Attributes.ItemOf("jump").Value
            Some (new Ship(sName,
                           if sJump="true" then true else false))
        else
            None
                                                                   
Partial-Case Active Patterns

Bear in mind, again, that the partial-match must yield an Option type, either Some<T> or None, from each rule. Other than that, the partial-match rules for extracting the domain objects out of the XML document are remarkably similar to the ones used for the multi-case match. This is actually comforting — it means that refactoring from one style to the other will be relatively trivial.

Still present is the problem of the nodes that the code will hit before the person or ship elements and the unrecognized elements like fairyTalePrincess. These are covered in the pattern-match itself:

let extract node : seq<obj> =
        let rec extract node : seq<obj> =
            match node with
            | Ship (s) ->
                Seq.singleton (s :> obj)
            | Person (p) ->
                Seq.singleton (p :> obj)
            | node when node.HasChildNodes ->
                let children = seq{ for n in node.ChildNodes -> n }
                Seq.collect (fun (n) -> extract n) children
            | _ ->
                Seq.empty
        extract node
                                                
Partial-Case Active Patterns

Again, we just have to help the F# compiler along just a little bit by defining the returned sequence to be a sequence of Objects. And rather than creating an explicit partial-match rule for Node objects, which really isn't a data type we're trying to work with, it's easier in this case to use a pattern guard to determine if the node has any child objects, and if so, just walk through each of those and recursively call extract on them. And, the stunning coup de grace, if the node doesn't match any of these three conditions, an empty sequence can be returned.

Later, if there is an element that is known to be ignorable — that is, one for which it can be stated with certainty that it has nothing of interest to us, the parser can recognize that element and use it as a signal to prune the XML hierarchy that is being parsed:

let extract node : seq<obj> =
        let rec extract node : seq<obj> =
            match node with
            | Ship (s) ->
                Seq.singleton (s :> obj)
            | Person (p) ->
                Seq.singleton (p :> obj)
            | node when node.HasChildNodes ->
                let children = seq{ for n in node.ChildNodes -> n }
                Seq.collect (fun (n) -> extract n) children
            | node when node.Name = "fairyTalePrincess" ->
                Seq.empty
            | _ ->
                Seq.empty
        extract node
                                                    
Partial-Case Active Patterns

This will prevent the traversal of the nodes underneath the fairyTalePrincess element and save a few matches and recursive calls. For a small element like fairyTalePrincess, it won't make a huge difference; in a multi-megabyte XML document consisting of elements of hundreds of child elements long, it will.

Regardless of whether the partial-case or multi-case approach is used, the net result is a relatively easy, scalable way to parse XML documents and extract the data into strongly type domain objects for further processing:

let results = extract xmlDoc
    for r in results do
        Console.WriteLine("Result: {0}", r)

And because the results of the extracting are a sequence of strongly typed objects, we could use pattern-matching again to walk through the sequence and do something more meaningful with the objects contained therein:

let results = extract xmlDoc
    for r in results do
        Console.WriteLine("Result: {0}", r)
        match r with
        | :? Ship as s ->
            Console.WriteLine("The ship {0} {1}",
                s.Name,
                if s.Jump = true
                    then "is jump-capable"
                    else "is slower-than-light")
        | :? Person as p ->
            Console.WriteLine("Found {0}", p.Name)
        | _ ->
            ()
                                                        
Partial-Case Active Patterns

Regardless of what work needs to be done, the active patterns feature of F# allows for some easily read and easily maintained code.

SUMMARY

In this chapter, we have covered how you deal with XML using F# employing two of the most common methods that F# programmers will use, LINQ-to-XML and XML DOM. LINQ-to-XML approaches that eschew XPath can certainly work; however, XPath, especially if others you are working with understand XPath, tends to produce more concise queries. Which you use is a matter of choice that is made most commonly by the group you are working with, any organizational standards you might have, or lacking any of those constrains, personal preference.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset