Making F# work with XML
Using Linq to XML with F#
Understanding how F# and DOM work together
Applying active patterns for XML parsing
XML is a fact of life for most software developers, regardless of language. It is found everywhere, from configuration files holding nothing but a small number of settings, to the giant stores of weather data provided in XML format by the National Oceanic and Atmospheric Administration of the United States. Although history will judge whether having all this XML was a good idea, the fact remains that it's there. Love it or hate it, developers have to deal with it.
In this chapter, you learn how to use F# with XML processing tools such as those found in the System.Xml
namespace in the .NET framework. This chapter covers techniques that simplify most XML processing tasks. Special attention is provided to demonstrate how you can use things such as XPath and active patterns to simplify the task of processing and transforming XML.
Boiled down to its essence, most things that deal with XML do the following:
Read XML, typically to store it in some sort of language representation
Query the XML, often again to extract parts of it into some sort of language representation
Process the XML, sometimes to produce more XML or to generate some side effect
Persist the XML, so that other programs can join in the fun of working with XML
For more details on the various W3C standards that make up the XML grammar and suite of specifications, we recommend the W3C website.
Of the various means to read and process XML, LINQ-to-XML, introduced in .NET 3.5, is likely the simplest way to query and process XML in F#.
Imagine writing an application that reads a weather forecast from the Internet and presents an average expected temperature to the user. When using the weather service in the Yahoo developer API to read the weather report, you might request the response from the Yahoo API URL http://weather.yahooapis.com/forecastrss?w=2484280
. The request that results from that call reads as follows:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <rss version="2.0" xmlns:yweather="http://xml.weather.yahoo.com/ns/rss/1.0" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"> <channel> <title>Yahoo! Weather - Romeoville, IL</title> <link>http://us.rd.yahoo.com/ dailynews/rss/weather/Romeoville__IL/ *http://weather.yahoo.com/forecast/USIL1019_f.html</link> <description>Yahoo! Weather for Romeoville, IL</description> <language>en-us</language> <lastBuildDate>Sun, 20 Dec 2009 4:44 pm CST</lastBuildDate> <ttl>60</ttl> <yweather:location city="Romeoville" region="IL" country="United States"/> <yweather:units temperature="F" distance="mi" pressure="in" speed="mph"/> <yweather:wind chill="27" direction="0" speed="0" /> <yweather:atmosphere humidity="83" visibility="5" pressure="30.07" rising="1" /> <yweather:astronomy sunrise="7:15 am" sunset="4:24 pm"/> <image> <title>Yahoo! Weather</title> <width>142</width> <height>18</height> <link>http://weather.yahoo.com</link> <url>http://l.yimg.com/a/i/us/nws/th/main_142b.gif</url> </image> <item> <title>Conditions for Romeoville, IL at 4:44 pm CST</title> <geo:lat>41.64</geo:lat> <geo:long>−88.08</geo:long> <link>http://us.rd.yahoo.com/dailynews/rss/weather/Romeoville__IL/
*http://weather.yahoo.com/forecast/USIL1019_f.html</link> <pubDate>Sun, 20 Dec 2009 4:44 pm CST</pubDate> <yweather:condition text="Cloudy" code="26" temp="27" date="Sun, 20 Dec 2009 4:44 pm CST" /> <description><![CDATA[ <img src="http://l.yimg.com/a/i/us/we/52/26.gif"/><br /> <b>Current Conditions:</b><br /> Cloudy, 27 F<BR /> <BR /><b>Forecast:</b><BR /> Sun - Light Snow Early. High: 29 Low: 23<br /> Mon - Cloudy. High: 28 Low: 25<br /> <br /> <a href="http://us.rd.yahoo.com/dailynews/rss/weather/Romeoville__IL/ *http://weather.yahoo.com/forecast/USIL1019_f.html">Full Forecast at Yahoo! Weather</a><BR/><BR/> (provided by <a href="http://www.weather.com" >The Weather Channel</a>)<br/> ]]></description> <yweather:forecast day="Sun" date="20 Dec 2009" low="23" high="29" text="Light Snow Early" code="14" /> <yweather:forecast day="Mon" date="21 Dec 2009" low="25" high="28" text="Cloudy" code="26" /> <guid isPermaLink="false">USIL1019_2009_12_20_16_44_CST</guid> </item> </channel> </rss><!-- api6.weather.ac4.yahoo.com compressed/chunked Sun Dec 20 15:02:47 PST 2009 -->
Although reading this whole request manually and parsing it might be an interesting exercise, Linq to XML with F# provide us a much simpler means to perform this task. Start off by reading the result of the URL into an XDocument
, from Linq to XML:
let weatherXml = "http://weather.yahooapis.com/forecastrss?w=2484280" |> XDocument.Load
The simplest way to start reading XML when you have a valid XDocument
reference is to simply inspect the elements and do something with them:
let readTheElements = weatherXml.Elements() |> Seq.iter ( fun(e) -> do printfn "%s" e.Value )
The preceding function takes all the Elements at the root of the document (in this case, a single element of an RSS feed at the top level) and prints them out to standard output (in this case, the console). This includes not just those elements at the top level of the document, but all elements, regardless of level. The sequence of elements is passed to Seq.iter
, which applies the function to each XElement
in the sequence. The function being applied to each element takes the element, prints it to the screen, and returns unit
.
Of course, the question frequently comes up about what the order of the elements will be when doing queries against XML in this manner. In the aforementioned example, order is not relevant to the task, so it is not specified. That said, because the XML order is usually controlled by implementation-specific factors at the site that produces the XML, it is usually a bad idea to depend on implicit order from the document. Should we want to control the order of processing, the preferred approach is to take the sequence that results from Elements()
and process the results through Seq.orderBy
with the appropriate explicit ordering function.
Although we can certainly do a lot of interesting work simply reading XML and processing it somehow, the real power of LINQ to XML is in the query capability. In fact, if you are simply reading XML and doing your own queries using for loops
and if
statements, you are not really leveraging the power of F# or LINQ to XML to make your code more readable and efficient.
LINQ-to-XML was built to make code that queries XML documents more readable. Recall the previous short snippet of F# that accesses the URL that contains an upcoming weather forecast and then parses it into an XML document for further use:
let weatherXml = "http://weather.yahooapis.com/forecastrss?w=2484280" |> XDocument.Load
The preceding statement creates a weatherXml
function that returns an XDocument
representing a weather forecast for Romeoville, Illinois. XDocument
is the root of most LINQ-to-XML operations and has factory methods to create an XDocument
instance from a variety of different sources:
METHOD | PURPOSE |
---|---|
| Create an |
| Create an |
The resulting XDocument
provides a base from which further queries can be performed. For example, to produce a sequence of all elements that have forecast as the element name in the local namespace, we apply a filter to the descendant XElements
of the XDocument
, using the Seq.filter
higher-order function:
let weatherXml = "http://weather.yahooapis.com/forecastrss?w=2484280" |> XDocument.Load let forecastElements = weatherXml.Descendants() |> Seq.filter( fun(e) -> e.Name.LocalName = "forecast" )
The Descendants()
method on the object returned from weatherXml
returns a sequence of all the XElement
s below that given element. When called on an XDocument
, the sequence represents all the XElement
s in the particular XDocument
.
If we then assume that we want to find all the attributes of all the forecast elements, this is done easily by pipelining two more lines of code:
let weatherXml = "http://weather.yahooapis.com/forecastrss?w=2484280" |> XDocument.Load let allForecastAttributes = weatherXml.Descendants() |> Seq.filter( fun(e) -> e.Name.LocalName = "forecast" ) |> Seq.collect ( fun(e) -> e.Attributes() )
The Seq.map
gathers the attributes from each element, which yields a result of a sequence of sequences with all the attributes of various forecast elements. A good way to think about the result of that line is to imagine a series of piles of attributes, each pile relating to an element from which it came. The next line, Seq.concat
, puts all the attributes into a single sequence of attributes, stacking all the separate "piles" into a single stack of attributes.
From here, we may be interested only in attributes that relate to a low or high temperature in a forecast. To gather those attributes, we add the following:
let weatherXml = "http://weather.yahooapis.com/forecastrss?w=2484280" |> XDocument.Load let allHighAndLowAttributes = weatherXml.Descendants() |> Seq.filter( fun(e) -> e.Name.LocalName = "forecast" ) |> Seq.collect ( fun(e) -> e.Attributes() ) |> Seq.filter( fun(a) -> a.Name.LocalName = "high" || a.Name.LocalName = "low" )
The function allHighAndLowAttributes
returns a sequence of XAttribute
objects, namely, attributes named "high"
and "low"
, for computing the expected average temperature over all the forecasts provided by the Yahoo weather service. From here, if we want to compute the average, we simply need to take the values from string into double and then compute the average:
let weatherXml = "http://weather.yahooapis.com/forecastrss?w=2484280" |> XDocument.Load let averageForecastTemp = weatherXml.Descendants() |> Seq.filter( fun(e) -> e.Name.LocalName = "forecast" ) |> Seq.collect ( fun(e) -> e.Attributes() ) |> Seq.filter( fun(a) -> a.Name.LocalName = "high" || a.Name.LocalName = "low" ) |> Seq.map ( fun(a) -> a.Value |> Double.Parse ) |> Seq.average
Two statements are added here. First, the Seq.map
statement takes the Value
property of each attribute and parses it into a double
. Note that there is an assumption here that the attribute value is numeric for the sake of clarity; we could also use Double.TryParse
to filter out any attributes with a non-numeric value.
The second statement added takes the resulting double
s and produces the average using Seq.average
.
Of course, the readability and reusability of this leaves much to be desired, so a quick refactoring where the functions are more properly named and where the magic number in the URI is removed reads as follows:
let getYahooWeather id = let weatherXml = sprintf "http://weather.yahooapis.com/forecastrss?w=%d" id |> XDocument.Load let byForecastElements (e:XElement) = e.Name.LocalName = "forecast" let elementToAttributes (e:XElement) = e.Attributes() let byHighAndLowAttributes (a:XAttribute) = a.Name.LocalName = "high" || a.Name.LocalName = "low" let attributeValueToDouble (a:XAttribute) = a.Value |> Double.Parse let averageForecastTemp = weatherXml.Descendants() |> Seq.filter byForecastElements |> Seq.collect elementToAttributes |> Seq.filter byHighAndLowAttributes |> Seq.map attributeValueToDouble |> Seq.average averageForecastTemp let avgRomeovllTmp = 2484280 |> getYahooWeather printfn "Avg Forecast Temp for Romeoville, IL is %g" avgRomeovllTmp
Although this last step does not add much functionality to the query, it does make it simpler to understand what the query is doing. It also provides a means to perhaps compose other queries out of the parts that were used to build this one.
Of course, simply taking a forecast from a source and republishing it is not terribly useful. Many applications that deal with XML want to process the XML in some way that adds value to the original document.
In this use case, the goal is to take the weather information from Romeoville, IL, and add some useful information about such a "vacation destination." Let's start again by writing a function that creates the weather XDocument
based on the Yahoo "Where on Earth ID" of a locality:
let getYahooWeather id = sprintf "http://weather.yahooapis.com/forecastrss?w=%d" id |> XDocument.Load
To add the information, we need to create an XElement
that represents information about a community:
let makeCommunityInfoElement (s:string) = new XElement(XName.Get("communityinfo"),s)
One notable thing about this statement is that if s
is not constrained to string
, the XElement
constructor will not determine which constructor overload to call. This is because there are two constructors that have an XName
as a first parameter combined with a second parameter.
When we have these tools, all that is needed is a function that will insert an XElement
in some location in the document that makes sense. If what is wanted is an element to be inserted after all the individual forecast elements, write the following:
let insertCommunityInfo (doc:XDocument) (commInfo:XElement) = let last sequence = sequence |> Seq.skip( (sequence |> Seq.length) - 1 ) |> Seq.head let lastForecast = doc.Descendants() |> Seq.filter (fun(e) -> e.Name.LocalName = "forecast") |> last do commInfo |> lastForecast.AddAfterSelf
One thing that is quickly found is that there is no last
method on seq<a'>
, so one will need to be written. Doing so is a matter of skipping to the next to last item using Seq.skip
, passing it a length based on Seq.length - 1
, and then taking the Seq.head
of the resulting one item sequence.
When a workable implementation of last
is written, we can much more easily get the last forecast by taking all elements via document.Descendants()
, filtering by forecast, and taking the last item using last
.
With lastForecast
, the next step is to take the communityInfo XElement
and pass it to the AddAfterSelf
method. In this case, the do
syntax is used to more clearly specify that what follows will cause a side effect. Although we could achieve the same result by passing the result of AddAfterSelf
to ignore
, using the do
syntax more explicitly signals the reader of the code that we are doing something that causes a side effect, namely, mutating the XML structure by adding an element.
Next, put this all together using the following:
let townDoc = 2484280 |> getYahooWeather let townInfo = "Fine Bedroom Community with Two Sushi Bars" |> makeCommunityInfoElement do insertCommunityInfo townDoc townInfo
Here the code is simply getting an XDocument
to manipulate, getting an XElement
to add to the tree, and then affixing the new XElement
to the XDocument
using insertCommunityInfo
. Again, the do
statement is used to indicate that we expect the line to cause a side effect.
This is just one example of many things you can do to process XML. The following methods can also be called to provide various means of processing the XML tree.
METHOD | PURPOSE |
---|---|
| |
| Adds the specified content as a child |
| Adds the specified content immediately after this node |
| Adds an object to the annotation list |
| Adds the specified content immediately before this node |
| Adds the specified content as the first child |
| Removes this node from its parent |
| Removes all nodes and attributes |
| Removes all annotations |
| Removes all attributes |
| Removes all nodes |
| Replaces the child nodes and the attributes of this element with the specified content |
| Replaces the attributes of this element with the specified content |
| Replaces the children nodes with the specified content |
| Replaces this node with the specified content |
| Sets the value of an attribute, adds an attribute, or removes an attribute |
| Sets the value of a child element, adds a child element, or removes a child element |
| Sets the value of this element |
Of course, the standard warning about programming with side effects is true for any code where you are mutating the XML document using these methods. Using these methods with async workflows or any other technology like PLinq that may do the operations in a different order can result in bugs that are hard to re-create.
In the .NET framework, writing XML uses similar techniques, nearly all of which involve having a method that does XML serialization using the XmlWriter
class. The simplest way to write out an XML file would be to simply do the following, given an XDocument
named townDoc
:
let writeToXml (doc:XDocument) (file:string) = use xmlWriter = file |> XmlWriter.Create do doc.WriteTo xmlWriter do writeToXml townDoc "yourOutput.xml"
Running this code will serialize the contents of townDoc
to a copy of yourOutput.xml
in the current directory, overwriting the file if it already exists. One particularly important step here is to use the use
binding with the xmlWriter
so that the xmlWriter
object will be closed when writeToXml
completes, thereby flushing the underlying stream buffer.
Note that none of these examples worry about handling exceptions. Because these examples have nothing they can do in response to any thrown exception (such as a file being locked that we want to write to), it is assumed that the exceptions will be handled by something further up the call stack.
Frequently, the need arises to serialize to something other than a file. The following code will serialize a document to a MemoryStream
:
let writeToMemory (doc:XDocument) = use stream = new MemoryStream() use xmlWriter = stream |> XmlWriter.Create do doc.WriteTo xmlWriter //do something with the memory stream do writeToMemory townDoc
In the preceding case where we are writing to memory, we can create a stream
(note the use of use binding again), create an XmlWriter
using the stream
, and use the same WriteTo
method with the resulting XmlWriter
. The only difference here when compared to writing XML to a file is that we explicitly create the kind of stream we want to write to. Most other types of XML output will work in a similar fashion.
F# just as easily supports XML DOM, and for that matter, any other .NET-based API for working with XML content. Although LINQ-to-XML is certainly convenient, many developers (or their managers!) prefer to stick with DOM because of the status of XML DOM as a W3C standard.
Note that, technically, Microsoft's implementation of DOM isn't 100% compliant with the W3C DOM specification, because it takes a few of the DOM APIs and transforms them into something more C#/.NET-appropriate. Having said that, conceptually, they are identical.
Reading XML using XML DOM in F# is somewhat similar to the way you would do it using LINQ-to-XML:
let getYahooWeatherDOM id = use xmlReader = sprintf "http://weather.yahooapis.com/forecastrss?w=%d" id |> XmlReader.Create let xmlDoc = new XmlDocument() do xmlReader |> xmlDoc.Load xmlDoc
A key difference between the APIs is that the XmlDocument
object itself is explicitly created, rather than using a factory method like we did with Linq to XML. The XmlReader
object is also created separately. The load
method of the XmlDocument
then takes an XmlReader
as a parameter, populating the XmlDocument
.
Of course, there are other means of loading an XmlDocument
object, such as the following:
METHOD | PURPOSE |
---|---|
| Create an |
| Create an |
To start simply reading from the document, run the following:
let weatherDom = 2484280 |> getYahooWeatherDOM do weatherDom.SelectNodes("*") |> Seq.cast<XmlNode> |> Seq.iter ( fun(e) -> do printfn "%s" e.Value )
There are a couple complexities that emerge when reading via XML DOM. The first complexity is that rather than the simple act of reading, XML means we are going to work with XPath, a standard query language for working with XML. The SelectNodes
method of an XmlDocument
object takes an XPath query as a parameter. The query for seeking all elements that are at the top level of a document is a simple wildcard ("*"), which is used above with SelectNodes
.
Another complexity when dealing with DOM is that the results of XPath queries do not support the IEnumerable<T>
interface, which is required to use most Seq
methods to interact with the document. Should we want to use Seq
methods, the result of any SelectNodes
query has to be passed to Seq.cast<XmlNode>
, which takes an IEnumerable
and produces an IEnumerable<T>
. Note that this will fail if any objects in the source collection do not derive from XmlNode
; thankfully, everything that returns from SelectNodes
does in fact derive from XmlNode
, making this not an issue in this case.
Going further with querying using XML DOM involves getting more familiar with XPath syntax. In the prior querying example with Linq to XML, solving the problem required retrieval of all the attributes with values equal to high
and low
in the document. Below are some queries that demonstrate different means of querying for the attributes of interest, including the all high
or low
query that is useful for solving the problem:
let weatherDom = 2484280 |> getYahooWeatherDOM let allAttributesInTheDocument = weatherDom.SelectNodes("//@") let allLowAttributes = weatherDom.SelectNodes("//@low") //the query we really want... let allLowOrHighAttributes = weatherDom.SelectNodes("//@low | //@high")
Here are three different queries — all of which demonstrate, with increasing precision, ways to reach the attributes we are looking for. The "//"
string tells XPath
to recurse through the entire document hierarchy. The "@"
symbol then tells XPath to look for attributes. In the second query, we further specify the name of the attributes we are looking for (low
, in this case). The "//@low | //@high"
query specifies both high
and low
attributes.
Recall that the problem criteria require retrieval of the average temperature. To get the average temperature again, do the following:
let weatherDom = 2484280 |> getYahooWeatherDOM let allLowOrHighAttributes = weatherDom.SelectNodes("//@low | //@high") let nodeValueToDouble (n:XmlNode) = n.Value |> Double.Parse let averageTemp = allLowOrHighAttributes |> Seq.cast<XmlNode> |> Seq.map nodeValueToDouble |> Seq.average
Like before when the solution was to simply read elements, this solution requires casting the result of the XPath query using Seq.cast<XmlNode>
, so you can do further Seq
operations on it. When casted to XmlNode
, the next step is to cast the attribute strings that represent the temperatures into double
(using our previous nodeValueToDouble
function). When converted into a sequence of double
s, the average can be computed using Seq.average
.
XML DOM provides plenty of means to manipulate XML documents. In the previous example for LINQ-to-XML, information was added to the weather forecast. The process is similar in DOM, which starts by creating the XmlNode
that should be added to the DOM:
let communityInfo = weatherDom.CreateElement("communityinfo") do communityInfo.InnerText <- "Fine Community with Two Sushi Bars"
A key difference in DOM is that creation of elements in DOM happens via the parent document you want to create the element in, which is what is done in the previous weatherDom.CreateElement("communityinfo")
. Setting the content is a separate line of code where the InnerText
property is mutated to contain the content that we want.
let insertCommunityInfoDom (doc:XmlDocument) (commInfo:XmlNode) = let last sequence = sequence |> Seq.skip( (sequence |> Seq.length) - 1 ) |> Seq.head let lastForecast = doc.SelectNodes("//*[local-name()='forecast']") |> Seq.cast<XmlNode> |> last do lastForecast.ParentNode.InsertAfter(commInfo,lastForecast) |> ignore do insertCommunityInfoDom weatherDom commInfo
The insertion routine works a bit differently as well. The implementation of last
from the LINQ-to-XML example is borrowed (see the prior section on LINQ-to-XML). That is where the similarity ends though. Getting the last forecast element is a bit more involved, as there is a need to start by passing an XPath query into the XmlDocument
that specifies forecast elements.
Note that the forecast elements are actually in the yweather
namespace. In the LINQ-to-XML example, queries are easily based on element.Name.LocalName
, because the name is scoped to the yweather
namespace. When using XPath, we have to use the XPath local-name()
function to achieve a similar result. We could also do work to attach a namespace to the SelectNodes
query — however, it would be quite a bit more work to do so, and is probably unnecessary here.
When the weather nodes have been retrieved, the next step is a familiar cast to a sequence of XmlNode
using Seq.cast
and then grabbing the last
element. When the last
element is found, call InsertAfter
on the ParentNode
, whatever that might be, and pass it the new XmlNode
, as well as the lastForecast
node that is the node that will be appended. Although this call returns the XmlNode
back, because nothing further is needed from the XmlNode
, it can be safely passed to ignore
.
Writing XML out to a file, or a stream, using XML DOM is as simple, if not simpler, than LINQ-to-XML. To save to a file, it is as simple as calling the Save
method, passing it a filename or a fully qualified path:
weatherDom.Save("yourOutput.xml")
Should you want to save to a stream, be it a MemoryStream
or any other object that inherits from Stream
, pass the Stream object as well:
weatherDom.Save(someStream)
It is notable that writing XML out largely builds on the .NET IO capabilities, nicely making sure that XML processing concerns are separated from concerns related to how such XML is persisted. This general feature helps make sure that any code you write to deal with XML is very much focused on XML processing and not IO concerns.
If we remove the fascination with angle-brackets and structure and look carefully at an XML document, a curious thing emerges: XML looks very much like name-value pairs, name-value-attribute triplets (where the attributes are a list of name-value pairs themselves), name-value-children triplets (where children are another list), or name-value-attribute-children quads. In other words, from a certain point of view, XML documents are basically lists of tuples, nestled in a hierarchical relationship.
Given this perspective, and the fact that active patterns (described in detail in Chapter 6) are used in F# to do "data decomposition," it seems reasonable to expect that active patterns can help break XML trees down into more palatable data structures that F# can process more easily. And as it turns out, it is but requires a slightly different approach to querying and processing than what the imperative programmer is used to.
Let's work with a slightly different XML model, one that's a bit simpler than the XML returned by the weather service (and, arguably, more like the XML that flies around inside a corporate intranet), that describes various famous (and/or infamous) characters:
<data> <item> <person gender="male"> <name>Ted Neward</name> <age>38</age> <languages> <language>English</language> <language>French</language> <language>C#</language> <language>Java</language> <language>F#</language> <language>Scala</language> </languages> </person>
</item> <item> <person gender="male"> <name>Han Solo</name> <age>35</age> <languages> <language>Imperial Standard</language> <language>Wookiee</language> </languages> </person> </item> <item> <person gender="male"> <name>Gaius Baltar</name> <age>35</age> <languages> <language>Colonial English</language> <language>Cylon</language> </languages> </person> </item> </data>
The goal here is to use active patterns to break this document down into the repetitive person
structure that appears repeatedly and transform it into a form more easily used within an F# program.
Recall from the discussion on active patterns that three basic forms of active patterns are available: the single-case active pattern, which converts data from one form to another; the partial-case active pattern, which helps to match when data conversion failures are possible or likely; and the multi-case active pattern, which can take the input data and break it down into one of several different data groupings. Although only one structure appears in the preceding example (the person
structure), it remains a reasonable assumption to imagine that other data structures can, will, or do appear in the document later. This implies that either the partial-case or the multi-case active pattern will be best suited for extracting the data out of the document; the decision between the two will rest on whether the F# programmer believes they know the full set of data types that the document contains.
This is not a casually discarded decision — XML documents are often used where a certain amount of ambiguity in the data is expected or desired. Yet, much of the XML sent back and forth between organizations is intended to be a closed-set of data types nestled in between angle-brackets, with unrecoverable errors thrown when an unknown XML document is received. If ambiguity is expected or desired, then the partial-case should be considered, and if not, then the multi-case active pattern becomes the weapon of choice.
Just for pedagogical purposes, both approaches are considered.
The multi-case active pattern requires a single function, written in "banana clips" style, which contains all the possible atoms that the XML document can be decomposed into. For an easy start, consider an active pattern that breaks the document down into Node
and Leaf
elements, showing the basic tree structure of a document:
let (|Node|Leaf|) (node : #System.Xml.XmlNode) = if node.HasChildNodes then Node (node.Name, seq { for x in node.ChildNodes -> x }) else Leaf (node.InnerText)
Because the parameter to this active pattern can be either an XmlNode
or any of its subtypes, the type descriptor is prefixed with a "#"
to indicate subtype availability (as described in Chapter 9).
Using this pattern-match rule in a pattern-match statement becomes relatively trivial, allowing us to print the contents of any XML document in nicely indented form:
let printXml node = let rec printXml indent node = match node with | Leaf (text) -> printfn "%s%s" indent text | Node (name, nodes) -> printfn "%s%s:" indent name nodes |> Seq.iter (printXml (indent+" „)) printXml "" node
As might well be predicted, because the tree structure responds so well to a recursive-descent traversal through the nodes of the tree, the outer printXml
function is made up of an inner recursively aware function to do the actual work, threading an "indent" string (made up of nothing but whitespace) through the descent to give nicely formatted text printed to the console.
Of course, a breakdown of leaf
s and node
s isn't itself useful; more useful would be to extract the <person>
elements and their children into an easy-to-use structure in F#. Given the relatively simple structure of the <person>
element, it's easiest to imagine the data extracted as a tuple, specifically a string * string * int * seq<string>
tuple type, representing the person's gender, name, age, and the list of languages they speak. Extracting this via an active pattern would thus look like:
let (|Node|Leaf|Person|) (node : #System.Xml.XmlNode) = if node.Name = "person" then let pGender = node.Attributes.ItemOf("gender").Value let pName = node.Item("name").InnerText let pAge = Int32.Parse(node.Item("age").InnerText) let pLangNode = node.Item("languages") let pLangs = seq{ for l in pLangNode.ChildNodes -> l.InnerText } Person (pName, pGender, pAge, pLangs) else if node.HasChildNodes then Node (node.Name, seq { for x in node.ChildNodes -> x }) else Leaf (node.InnerText)
This can then be used to get the various "parts" out of the XML via traditional pattern-match construct:
let printXml node = let rec printXml indent node = match node with | Person (n, g, a, ls) -> printfn "%sPerson: %s, %s, %d, speaks %d langs" indent n g a (Seq.length ls) | Leaf (text) -> printfn "%s%s" indent text | Node (name, nodes) -> printfn "%s%s:" indent name nodes |> Seq.iter (printXml (indent+" ")) printXml "" node printXml xmlDoc
The Leaf
clause from the pattern-match and active pattern rule can be removed if needed, but the Node
clause is going to have to stay, unless specific rules to match on the DocumentElement
that forms the root XmlNode
of an XmlDocument
are written. In general, it seems prudent to keep the Node
clause around, with a match that forces the recursive descent further into the tree:
let (|Node|Person|) (node : #System.Xml.XmlNode) = if node.Name = "person" then let pGender = node.Attributes.ItemOf("gender").Value let pName = node.Item("name").InnerText let pAge = Int32.Parse(node.Item("age").InnerText) let pLangNode = node.Item("languages") let pLangs = seq{ for l in pLangNode.ChildNodes -> l.InnerText } Person (pName, pGender, pAge, pLangs) else if (node.HasChildNodes) then Node (seq { for x in node.ChildNodes -> x}) else failwith ("Unexpected data: " + node.ToString()) let printXml node = let rec printXml node = match node with | Node (nodes) -> nodes |> Seq.iter printXml | Person (n, g, a, ls) -> printfn "Person: %s is %s, %d, " + "and speaks %d languages" n g a (Seq.length ls) printXml node printXml xmlDoc
Typically, printing to the console is only done during development and debugging — most of the time, it is more useful to pull the data out of the XML document as a sequence of tuples or other strongly typed objects. This means rewriting the pattern-match itself to return a sequence of tuples:
let extract node = let rec extract node = match node with | Person (n, g, a, ls) -> Seq.singleton (n, g, a, ls) | Node (nodes) -> Seq.collect (fun (n) -> extract n) nodes extract node let results = extract xmlDoc for r in results do Console.WriteLine("Result: {0}", r)
Of course, after a certain point, tuples may want to become fully fledged domain objects:
type Person(name : string, gender : string, age : int, langs : seq<string>) = member p.Name with get() = name member p.Gender with get() = gender member p.Age with get() = age member p.Languages with get() = langs override p.ToString() = String.Format("[Person: {0} is {1}, {2}," + ", and speaks {3}]", name, gender, age.ToString(), (Seq.reduce (fun (l) (s) -> l + ", and " + s) langs))
When that happens, the active-pattern rule takes that into account, returning a Person
object instead of a tuple:
let (|Node|Person|) (node : #System.Xml.XmlNode) = if node.Name = "person" then let pGender = node.Attributes.ItemOf("gender").Value let pName = node.Item("name").InnerText let pAge = Int32.Parse(node.Item("age").InnerText) let pLangNode = node.Item("languages") let pLangs = seq{ for l in pLangNode.ChildNodes -> l.InnerText } Person (new Person(pName, pGender, pAge, pLangs)) else if (node.HasChildNodes) then Node (seq { for x in node.ChildNodes -> x}) else failwith ("Unexpected data: " + node.ToString())
which in turn makes the transformation from XML to a sequence of domain objects just a bit different:
let extract node = let rec extract node = match node with | Person (p) -> Seq.singleton p | Node (nodes) -> Seq.collect (fun (n) -> extract n) nodes extract node let results = extract xmlDoc for r in results do Console.WriteLine("Result: {0}", r)
The results of the extract function will be a seq<Person>
, which is about as straightforward an extraction result as the F# programmer could want. Things get a tad more interesting (thanks to F#'s type-inference) when more than one domain object can appear in the XML; the F# type-inferencer right now assumes that extract produces a sequence of Person
objects out of the XML document. If a new domain type is introduced into the system, such as:
type Ship(name : string, jumpCapable : bool) = member s.Name with get() = name member s.Jump with get() = jumpCapable override s.ToString() = String.Format("[Ship: {0}, jump={1}]", name, jumpCapable.ToString())
then extracting it from the XML is ridiculously simple, as we'd hope:
let (|Node|Person|Ship|) (node : #System.Xml.XmlNode) = if node.Name = "person" then let pGender = node.Attributes.ItemOf("gender").Value let pName = node.Item("name").InnerText let pAge = Int32.Parse(node.Item("age").InnerText) let pLangNode = node.Item("languages") let pLangs = seq{ for l in pLangNode.ChildNodes -> l.InnerText } Person (new Person(pName, pGender, pAge, pLangs)) else if (node.Name = "ship") then let sName = node.Item("name").InnerText let sJump = node.Attributes.ItemOf("jump").Value Ship (new Ship(sName, if sJump="true" then true else false)) else if (node.HasChildNodes) then Node (seq { for x in node.ChildNodes -> x}) else failwith ("Unexpected data: " + node.ToString())
But the pattern-match rule has to change slightly; if the Ship clause is simply inserted into the pattern-match, the compiler complains:
let extract node = let rec extract node = match node with | Person (p) -> Seq.singleton p | Ship (s) -> Seq.singleton s | Node (nodes) -> Seq.collect (fun (n) -> extract n) nodes extract node let results = extract xmlDoc for r in results do Console.WriteLine("Result: {0}", r)
specifically, that the Ship
clause doesn't return a Person
object. This is because the type-inferencer in F# has assumed that the extract function wants to take in an XmlNode
and return a sequence of Person
objects, which obviously the Ship
object isn't. If Ship
inherits from Person
, then obviously the compiler will be OK with this, but as written right now, Ship
doesn't.
Fortunately, Ship
and Person
do both inherit from a common base class, System.Object
, so it's simply a matter of telling the F# compiler this, doing the upcast from the domain object to System.Object
during the pattern-match, and asking the compiler to see the result as a sequence of Object
rather than a sequence of Person
:
let extract node : seq<obj> = let rec extract node : seq<obj> = match node with | Ship (s) -> Seq.singleton (s :> obj) | Person (p) -> Seq.singleton (p :> obj) | Node (nodes) -> Seq.collect (fun (n) -> extract n) nodes extract node let results = extract xmlDoc for r in results do Console.WriteLine("Result: {0}", r)
And now, any number of domain types can be added to the active pattern rule and returned from the extract function.
Unfortunately, the drawback to the multi-case solution comes when the upstream source of the XML document throws something "new" into the XML stream. Not that clients would actually ever do that, of course, but still, a more robust solution would allow for a certain amount of forgiveness.
Operating on a slightly different XML example from before, we can introduce some "unknown" structure into the XML document that is to be parsed and extracted into domain objects:
<data> <item> <person gender="male"> <name>Ted Neward</name> <age>38</age> <languages> <language>English</language> <language>French</language> <language>C#</language> <language>Java</language> <language>F#</language> <language>Scala</language> </languages> </person> </item> <item> <person gender="male"> <name>Han Solo</name> <age>35</age> <languages> <language>Imperial Standard</language> <language>Wookiee</language> </languages> </person> </item> <ship jump="true"> <name>Millenium Falcon</name> </ship> <fairyTalePrincess> <name>Sleeping Beauty</name> <ending>Happy</ending> </fairyTalePrincess> <fairyTalePrincess> <name>Cinderella</name> <ending>Happy</ending> </fairyTalePrincess> <item> <person gender="male"> <name>Gaius Baltar</name> <age>35</age> <languages> <language>English</language> <language>Cylon</language> </languages> </person> </item>
<ship jump="true"> <name>Galactica</name> </ship> </data>
Where'd those fairyTalePrincess
elements come from? Clearly, as the children's television show implied, they are the "one of these things that doesn't belong," but what can we do? Clients sometimes don't send the data that is expected.
The partial-case active pattern requires a function for each "thing" that the XML might be extracted into, with a wildcard at the end of the name to indicate that this might not always succeed, and this is where the partial-match will be of better benefit. Because the partial-match pattern doesn't assume that it has all the possible cases the source (the XML node) can extract into, it will neatly and efficiently bypass any source that it doesn't understand.
To start, create the partial-match active pattern rules for the two types we do know about, Person
and Ship
:
let (|Person|_|) (node : #System.Xml.XmlNode) = if node.Name = "person" then let pGender = node.Attributes.ItemOf("gender").Value let pName = node.Item("name").InnerText let pAge = Int32.Parse(node.Item("age").InnerText) let pLangNode = node.Item("languages") let pLangs = seq{ for l in pLangNode.ChildNodes -> l.InnerText } Some(new Person(pName, pGender, pAge, pLangs)) else None let (|Ship|_|) (node : #System.Xml.XmlNode) = if (node.Name = "ship") then let sName = node.Item("name").InnerText let sJump = node.Attributes.ItemOf("jump").Value Some (new Ship(sName, if sJump="true" then true else false)) else None
Bear in mind, again, that the partial-match must yield an Option type, either Some<T>
or None
, from each rule. Other than that, the partial-match rules for extracting the domain objects out of the XML document are remarkably similar to the ones used for the multi-case match. This is actually comforting — it means that refactoring from one style to the other will be relatively trivial.
Still present is the problem of the nodes that the code will hit before the person or ship elements and the unrecognized elements like fairyTalePrincess
. These are covered in the pattern-match itself:
let extract node : seq<obj> = let rec extract node : seq<obj> = match node with | Ship (s) -> Seq.singleton (s :> obj) | Person (p) -> Seq.singleton (p :> obj) | node when node.HasChildNodes -> let children = seq{ for n in node.ChildNodes -> n } Seq.collect (fun (n) -> extract n) children | _ -> Seq.empty extract node
Again, we just have to help the F# compiler along just a little bit by defining the returned sequence to be a sequence of Objects. And rather than creating an explicit partial-match rule for Node
objects, which really isn't a data type we're trying to work with, it's easier in this case to use a pattern guard to determine if the node has any child objects, and if so, just walk through each of those and recursively call extract on them. And, the stunning coup de grace, if the node doesn't match any of these three conditions, an empty sequence can be returned.
Later, if there is an element that is known to be ignorable — that is, one for which it can be stated with certainty that it has nothing of interest to us, the parser can recognize that element and use it as a signal to prune the XML hierarchy that is being parsed:
let extract node : seq<obj> = let rec extract node : seq<obj> = match node with | Ship (s) -> Seq.singleton (s :> obj) | Person (p) -> Seq.singleton (p :> obj) | node when node.HasChildNodes -> let children = seq{ for n in node.ChildNodes -> n } Seq.collect (fun (n) -> extract n) children | node when node.Name = "fairyTalePrincess" -> Seq.empty | _ -> Seq.empty extract node
This will prevent the traversal of the nodes underneath the fairyTalePrincess
element and save a few matches and recursive calls. For a small element like fairyTalePrincess
, it won't make a huge difference; in a multi-megabyte XML document consisting of elements of hundreds of child elements long, it will.
Regardless of whether the partial-case or multi-case approach is used, the net result is a relatively easy, scalable way to parse XML documents and extract the data into strongly type domain objects for further processing:
let results = extract xmlDoc for r in results do Console.WriteLine("Result: {0}", r)
And because the results of the extracting are a sequence of strongly typed objects, we could use pattern-matching again to walk through the sequence and do something more meaningful with the objects contained therein:
let results = extract xmlDoc for r in results do Console.WriteLine("Result: {0}", r) match r with | :? Ship as s -> Console.WriteLine("The ship {0} {1}", s.Name, if s.Jump = true then "is jump-capable" else "is slower-than-light") | :? Person as p -> Console.WriteLine("Found {0}", p.Name) | _ -> ()
Regardless of what work needs to be done, the active patterns feature of F# allows for some easily read and easily maintained code.
In this chapter, we have covered how you deal with XML using F# employing two of the most common methods that F# programmers will use, LINQ-to-XML and XML DOM. LINQ-to-XML approaches that eschew XPath can certainly work; however, XPath, especially if others you are working with understand XPath, tends to produce more concise queries. Which you use is a matter of choice that is made most commonly by the group you are working with, any organizational standards you might have, or lacking any of those constrains, personal preference.