If the only new thing we have to offer is an improved version of the past, then today can only be inferior to yesterday. Hypnotized by images of the past, we risk losing all capacity for creative change.
Robert Hewison
XSLT 2.0 has numerous additions and enhancements that make doing hard things in XSLT easier. This chapter will help the XSLT 1.0 veteran make the transition to 2.0 and also help the XSLT newbie understand how to better approach stylesheet design in 2.0.
XSLT 2.0 derives much of its improved functionality from XPath 2.0, so if you have skipped Chapter 1, then you should consider reading that first.
As with most progress in software technology, version 2.0 is an improved version of the old rather than a complete rethinking of stylesheet design. It also falls short in some features that could have made it a much better language (e.g., introspection and direct support for higher-order functions). However, XSLT 2.0 goes a long way toward elevating the drudgery of developing complex stylesheet logic. The key features in 2.0 are XPath functions, grouping, enhanced modes, cleaner idioms for reusable code, a richer type system, and enhanced support for text processing. All of these features are used to improve the 2.0 recipes that appear in this edition, but this chapter provides a one-stop reference for the new features themselves.
XSLT 1.0 did not support writing XPath functions in XSLT, and named templates are an awkward substitute.
Prefer XSLT 2.0 functions over named templates when the purpose is solely to compute a result rather then create serialized content. Below I show a potpourri of examples where functions are much more convenient compared to named templates:
<!-- Mathematical computations --> <xsl:function name="ckbk:factorial" as="xs:decimal"> <xsl:param name="n" as="xs:integer"/> <xsl:sequence select="if ($n eq 0) then 1 else $n * ckbk:factorial($n - 1)"/> </xsl:function> <-- Simple mappings --> <xsl:function name="ckbk:decodeColor" as="xs:string"> <xsl:param name="colorCode" as="xs:integer"/> <xsl:variable name="colorLookup" select="('black','red','orange','yellow', 'green','blue','indigo','violet','white')"/> <xsl:sequence select="if ($colorCode ge 0 and $colorCode lt count($colorLookup)) then $colorLookup[$colorCode] else 'no color'"/> </xsl:function> <-- String manipulations --> <xsl:function name="ckbk:reverse"> <xsl:param name="input" as="xs:string"/> <xsl:sequence select="codepoints-to-string(reverse( string-to-codepoints($input)))"/> </xsl:function>
Recall that named templates are an alternative to templates that are
invoked strictly by matching. Named templates act much like
procedures in transitional languages because an XSLT programmer
explicitly transfers control to a named template via
xsl:call-tempate
, rather than relying on the more
declarative semantics of template matching. A nice feature of XSLT
(both 1.0 and 2.0) is that you can mix these styles by giving a
template both a pattern and a name.
User-defined XSLT 2.0 functions are not a substitute for named templates. The key question to ask yourself choosing one over the other is: are you simply computing a result or are you creating a reusable content producer? The former is better expressed as a function and the latter as a template. In XSLT 2.0 Programmer’s Reference, Michael Kay recommends using functions in cases where you are simply selecting nodes and templates when you are creating new ones, even though XSLT will allow you to use functions for the latter.
This function is selecting nodes:
<xsl:function name="getParts" as="item()*"> <xsl:param name="startPartId" as="xs:string"/> <xsl:param name="endPartId" as="xs:string"/> <xsl:sequence select="//Parts/part[@partId ge $startPartId and @partId le $endPartId]"/> </xsl:function>
This function is creating new nodes but perhaps a template would make more sense:
<xsl:function name="getPartsElem" as="item()"> <xsl:param name="startPartId" as="xs:string"/> <xsl:param name="endPartId" as="xs:string"/> <Parts> <xsl:copy-of select="//Parts/part[@partId ge $startPartId and @partId le $endPartId]"/> <Parts> </xsl:function>
XSLT 1.0 did not have explicit support for grouping so indirect and potentially confusing techniques had to be invented.
Take advantage of the powerful
xsl:for-each-group
instruction for all your grouping
needs. This instruction has a mandatory select
attribute where you provide an expression that defines the population
of nodes you wish to group. You then use one of four grouping
attributes to define the criteria for dividing the population into
groups. These are explained next. As each group is processed, you can
use the function current-group()
to access all nodes in the current group.
You use the function current-grouping-key()
to access the value of the key that defines
the group being processed when grouping by value or adjacent nodes.
The current-grouping-key()
function has no value
when grouping by start or ending node.
You can also sort groups by inserting one or more
xsl:sort
instruction to define the sorting
criteria just as you do when using xsl:for-each
.
A classic grouping
problem
arises quite often when processing data
into reports. Consider sales data. Product managers will often want
data grouped by sales region, product type, or salesperson, depending
on what problem they are trying to solve. You use the
group-by
attribute to
define an expression that determines that value or values that cause
nodes in the population to group together. For example,
group-by="@dept
" would cause
nodes that have the same dept value to group together:
<xsl:template match="Employees"> <EmployeesByDept> <xsl:for-each-group select="employee" group-by="@dept"> <dept name="{current-grouping-key()}"> <xsl:copy-of select="current-group()"/> </dept> </xsl:for-each-group> </EmployeesByDept> </xsl:template>
In some contexts, such
as document processing, you want to
consider nodes that share a common value provided they are also
adjacent to each other. As with group-by,
group-adjacent
defines an expression used to
determine the value used to perform the grouping, but two nodes that
have such a value will only be in the same group if they are adjacent
in the population. The value of group-adjacent must be singleton, as
empty sequences or multi-valued sequences will cause an error.
Consider a document consisting of para
elements
interspersed with other heading elements. You would like to extract
only the para
elements without losing track of the
fact that some sequences of para
elements belong
together as part of the same topic:
<xsl:template match="doc"> <xsl:copy> <xsl:for-each-group select="*" group-adjacent="name()"> <xsl:if test="self::para"> <topic> <xsl:copy-of select="current-group()"/> </topic> </xsl:if> </xsl:for-each-group> </xsl:copy> </xsl:template>
Frequently, especially
in
document processing, a group of related
nodes is demarcated by a particular node such as a title or subtitle,
or other type of heading node. Grouping by starting node makes it
easy to process these loosely structured documents. The
group-starting-with
attribute defines a pattern that matches nodes in the population that
are the starting nodes of the group. This is similar to the patterns
you use with the match
attribute in
xsl:template
instructions. When the pattern
matches a node in the population, all subsequent nodes are part of
the group until another match is made. The first node in the
population defines a group whether it matches or not. This implies
that a population will have at least one group, the entire
population, even if the pattern is never matched.
A classic example involves reconstituting structure from an
unstructured document. XHTML is a good example of a loosely
structured markup language, especially in regard to the use of
heading elements (h1
, h2
,
etc.). The following transformation will add some structure by
nesting each group, designated by a starting h1
element, in a div
element:
<xsl:template match="body">
<xsl:copy>
<xsl:for-each-group select="*" group-starting-with="h1"
>
<div>
<xsl:apply-templates select="current-group()"/>
</div>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
This form of grouping
is similar to
group-starting-with
but uses the
group-ending-with
pattern to define the last node
that will be in the current group. The first node in the population
starts a new group, so there is always at least one group even if the
pattern does not match any nodes.
Of all the grouping methods, grouping by ending node will typically
find less application. This is because documents designed for human
consumption use leading elements, such as headings, to single new
groups, rather than trailing ones. In XSLT 2.0
Programmer’s Reference, Michael Kay
provides an example of a series of documents having been broken into
chunks for purpose of transmission. In this example, the document
boundaries are separated by the absence of an attribute
continued='yes
‘. A slightly more probable example
is one where you want to add structure to a flat document by chunking
elements based on some criteria that designate the end of a chunk.
For example, you can group paragraphs into sections of five
paragraphs with the following code:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:template match="doc"> <xsl:copy> <xsl:for-each-group select="para" group-ending-with="para[position() mod 5 eq 0]"> <section> <xsl:for-each select="current-group()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:for-each> </section> </xsl:for-each-group> </xsl:copy> </xsl:template> <xsl:template match="@* | node()"> <xsl:copy> <xsl:apply-templates select="@* | node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
The Muenchian Method, named after Steve Muench of Oracle, was an innovative way to group data in XSLT 1.0. It took advantage of XSLT’s ability to index documents using a key. The trick involves using the index to efficiently figure out the set of unique grouping keys and then using this set to process all nodes in the group:
<xsl:key name="products-by-category" select="product" use="@category"/> <xsl:template match="/"> <xsl:for-each select="//product[count(. | key('products-by-category', @category)[1]) = 1]"> <xsl:variable name="current-grouping-key" select="@category"/> <xsl:variable name="current-group" select="key('current-grouping-key', $current-grouping-key)"/> <xsl:for-each select="$current-group/*"> <!-- processing for elements in group --> <!-- you can use xsl:sort here also, if necessary --> </xsl:for-each/> </xsl:for-each/> <xsl:template match="/">
Although the Muenchian method will continue to work in 2.0, you
should prefer for-each-group
because it is likely
to be as efficient and probably more so. Just as important, it will
make your code more comprehensible, especially to XSLT novices.
Further, you use the same basic instruction to get access to the four
distinct grouping capabilities. The Muenchian method can only be used
for value-based grouping. Backward compatibility to XSLT 1.0 is
probably the only compelling reason to continue to use Muenchian
grouping in
XSLT 2.0.
Use XSLT 2.0’s new mode
attribute’s capabilities to eliminate code
duplication. Consider a simple example of a stylesheet that uses two
different modes to process a document in two passes. In each pass,
you would like to ignore text nodes, by default. In XSLT 1.0, you
would have to write something like the following:
<xsl:template match="text()" mode="mode1"/> <xsl:template match="text()" mode="mode2"/>
However, in 2.0, you can remove the redundancy:
<xsl:template match="text()" mode="mode1 mode2"/>
Or if the intention is to match in all modes:
<xsl:template match="text()" mode="#all"/>
Granted, this is a small improvement, but it has a large payback for stylesheets that are more complex, use a large number of modes, share a lot of code between modes, or are under frequent maintenance.
A rule of thumb that I adhere to when using modes in 2.0 is to always
use #current
rather than an explicitly named mode
if my intention is to continue processing in the present mode:
<xsl:template match="author" mode="index">
<div class="author">
<xsl:apply-templates mode="#current"/>
</div>
</xsl:template>
This has two immediately beneficial consequences. First, if you later
decide you picked a bad name for the mode and want to change it, you
will not need to change any of the calls to
xsl:apply-templates
. Second, if you add new modes
to the template, it will continue to work without further change:
<xsl:template match="author" mode="index body"
>
<div class="author">
<xsl:apply-templates mode="#current"/>
</div>
</xsl:template>
Use XSLT 2.0’s extended type system to create precise and type-safe functions and templates.
as
attribute on elements that hold or return data.These elements include xsl:function
,
xsl:param
, xsl:template
,
xsl:variable
, and
xsl:with-param
.
type
attribute on elements that create data.These elements include xsl:attribute
,
xsl:copy
, xsl:copy-of
,
xsl:document
, xsl:element
, and
xsl:result-document
.
All conforming XSLT 2.0 processors allow you to use the simple data
types such as xs:integer
or
xs:string
to describe variables or parameters.
Further, these types can be used with the symbols
*
, +
and ?
to describe sequences of these simple types:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" > <!-- x is a sequence of zero or more strings --> <xsl:variable name="x" select="para"as="xs:string*"
/> <!-- y is a sequence of one or more strings. We code the select in a way that guarantees this although if you knew there must be at least one para element, you could ommit the if expression --> <xsl:variable name="y" select="if (para) then para else ''"as="xs:string+"
/> <!-- z is a sequence of one or more strings. --> <xsl:variable name="z" select="para[1]"as="xs:string?"
/> </xsl:stylesheet>
With a schema-aware processor, you can go even further and refer to
both simple and complex types from a user-defined schema. A
schema-aware processor must be made aware of user-defined types via
the top-level instruction xsl:import-schema
:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:my="http://www.mydomain.com/ns/my"> <xsl:import-schema schema-location="my-schema.xsd"/> <xsl:template match="/"> <!--Validate that the resulting element conforms to my:BookType --> <xsl:element name="Book" type="my:BookType"> <xsl:apply-templates/> </xsl:element> </xsl:template> <!-- ... --> </xsl:stylesheet>
You should not use xsl:import-schema
if you do not
have access to a schema-aware processor. If you use a schema-aware
processor but wish to make your stylesheets compatible with
non-schema-aware processors, then you should use the attribute
use-when="system-property('xsl:schema-aware')
=
'yes'
" on all elements that
require a schema-aware processor.
If you need to port 1.0 stylesheets to 2.0, you will want to watch
out for several gotchas. Some of these problems can be eliminated by
using XSLT 1.0 compatibility mode; that is, by using
version=1.0
in the stylesheet element,
<xsl:stylesheet
version="1.0">
. However, if you want to begin
migrating old stylesheets to 2.0, there are other solutions to these
incompatibilities, as explained next.
XSLT 2.0 processors are not obligated to support backward compatibility mode, although most probably will. If it is not supported, the processor will signal an error.
Consider the
fragment <xsl:value-of
select="substring-before(Name, ' ')"/>
. What happens if
this is evaluated in a context that includes more than one
Name
element? In XSLT 1.0, only the first one
would be used, and the rest would be ignored. However, XSLT 2.0 is
stricter and signals a type error because the first argument if
substring-before
can only be a sequence of 0 or 1
strings.
To remedy this, you should get in the habit of writing
<xsl:value-of
select="substring-before(Name
[1]
, ' ')"/>
. On the
other hand, you may want to know about these errors because they
might signal a bug in the way the stylesheet or its input documents
are written. An alternative fix, which may be applicable in some
circumstances, is to combine multiple nodes into a single node before
presenting the sequence to a function expecting only one. For
example, <xsl:value-of
select="substring-before(
string-join(Name)
, '
')"/>
would not generate an error.
XSLT 1.0 was very lax when it came to type conversions. If a function expected a number and you provided a string, it would do its best to convert the string to a number and visa versa. The same applied to conversions among Boolean and integer or Boolean and string. The old behavior can be preserved by using 1.0 compatibility mode. However, you can also explicitly convert values:
<xsl:variable name="a" select=" '10' "/> <xsl:value-of select="sum($a, 1)"/> <!-- Error --> <xsl:value-of select="sum(number($a), 1)"/> <!-- OK --> <xsl:value-of select="string-length(10)"/> <!-- Error --> <xsl:value-of select="string-length(string(10))"/> <!-- OK --> <xsl:value-of select="string-length(string(true()))"/> <!-- OK, equals 4 --> <xsl:value-of select="1 + true()"/> <!-- Error --> <xsl:value-of select="1 + number(true())"/> <!-- OK, equals 2 -->
In XSLT 1.0, if you invoked a template with
xsl:call-template
passed parameters that the
template did not define, the extra parameters were silently ignored.
In 2.0, this is an error. You can disable this error by using 1.0
compatibility mode. There is no other work around, except removing
the
extra parameters or introducing
defaults into the existing template.
Examples of this can be seen in both the
xsl:number
and xsl:template
instructions. If you use level
and
value
attributes together in
xsl:number
, the level
attribute
is ignored in 1.0, but this is an error in 2.0. Similarly, with
xsl:template
, you cannot specify
priority
or mode
in 2.0, if
there is no match
attribute defined.
Use of backward-compatibility mode to correct errors in 1.0 stylesheets has other consequences. In particular, it means that some things will behave differently. In 1.0 compatibility mode:
xsl:value-of
will output only the first item of a
sequence rather than all items separated by spaces.
an attribute value template (e.g. <foo
values=
"{foo}"/>
) will output
only the first item of a sequence rather than all items separated by
spaces.
the first number in a sequence will be output rather than all numbers
separated by spaces when using xsl:number
.
For these reasons, it would be wise not to rely on backward-compatibility mode for new stylesheet development intended to target a 2.0-compliant processor.
You would like to graduate from cut and paste XSLT reuse to creating libraries of reusable XSLT code.
Clearly XSLT 2.0 is not an object-oriented programming language. However, object orientation is as much about how to engineer generic reusable code as it is about the creation of classes, objects, inheritance, encapsulation, and the like.
There are two features of XSLT 2.0 that facilitate an object-oriented
style. The first is the instruction xsl:next-match
and the second is tunnel parameters. The
xsl:next-match
instruction is a generalization of
XSLT 1.0’s xsl:apply-imports
.
Recall that in XSLT 1.0, you use xsl:apply-imports
to invoke templates of lower import precedence. The
xsl:next-match
instruction generalizes this
behavior by allowing you to invoke matching templates of lower
priority within the same stylesheet and importing stylesheets. This
is akin to calling a base class method in OO programming:
<xsl:template match="author | title | subtitle | deck" priority="1"> <a name="{generate-id()}"> <span class="{name()}"> <xsl:apply-templates/> </span> </a> </xsl:template> <xsl:template match="author" priority="2"> <div> <span class="by">By </span> <xsl:next-match/> </div> </xsl:template> <xsl:template match="title" priority="2"> <h1 class="title"><xsl:next-match/></h1> </xsl:template> <xsl:template match="deck" priority="2"> <h2 class="deck"><xsl:next-match/></h2> </xsl:template> <xsl:template match="subtitle" priority="2"> <h2 class="subtitle"><xsl:next-match/></h2> </xsl:template>
A further enhancement in 2.0 is the ability to pass parameters to
both xsl:next-match
and
xsl:apply-imports
:
<xsl:next-match> <xsl:with-param name="indent" select="2"/> </xsl:next-match>
A further capability in XSLT 2.0 templates is tunnel
parameters
,
a form of dynamic scoping that is popular in functional programming.
Tunnel parameters allow calls to
xsl:apply-templates
to pass parameters that are
not necessarily known to the immediately matching templates. However,
these templates are transparently carried over from call to call
until they arrive at a template that contains such parameters. Note
that the attribute tunnel="yes
" must be used both
at the point of call and the point where the parameter is accepted:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <!--Standard processing rules for doc --> <xsl:import href="doc.xslt"/> <!-- A custom parameter not envisioned by the author of doc.xslt --> <xsl:param name="customParam"/> <xsl:template match="/"> <!--Invoke templates from doc.xslt that have no knowledge of customParam --> <xsl:apply-templates> <xsl:with-param name="customParam" select="$customParam"tunnel="yes"
/> </xsl:apply-templates> </xsl:template> <xsl:template match="heading1"> <!-- Do something special with heading1 elements based on customParam --> <xsl:param name="customParam"tunnel="yes"
/> <!-- ... --> </xsl:template> </xsl:stylesheet>
This is an extremely important enhancement to XSLT, because it allows you to decouple application-specific templates from generic ones without introducing global parameters or variables.
In object-oriented development, the notion of design patterns is quite popular. These are tried and true techniques that have broad application in a variety of problems. The patterns facilitate communication between developers by providing semi-standard names for the techniques described by the pattern, as well as the applicable context.
The facilities described in this recipe and Recipe 6.3 can be used to implement some of the standard patterns from an XSLT perspective. Next we adapt some standard patterns to the domain of XSLT.
Define the skeleton of a stylesheet in an operation, deferring some steps to templates implemented by importing stylesheets. Template Method lets others redefine certain steps of a transformation without changing the transformation’s structure.
Consider a stylesheet that defines the standard by which your company renders an XML document to the web:
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" > <xsl:output method="xhtml" indent="yes"/> <xsl:param name="titlePrefix" select=" '' " as="xs:string"/> <xsl:template match="/"> <html> <head> <title><xsl:value-of select="concat($titlePrefix, /*/title)"/></title> </head> <body> <xsl:call-template name="standard-processing-sequence"/> </body> </html> </xsl:template> <xsl:template name="standard-processing-sequence"> <xsl:apply-templates mode="front-matter"> <xsl:with-param name="mode" select=" 'front-matter' " tunnel="yes" as="xs:string"/> </xsl:apply-templates> <xsl:apply-templates mode="toc"> <xsl:with-param name="mode" select=" 'toc' " tunnel="yes" as="xs:string"/> </xsl:apply-templates> <xsl:apply-templates mode="body"> <xsl:with-param name="mode" select=" 'body' " tunnel="yes" as="xs:string"/> </xsl:apply-templates> <xsl:apply-templates mode="appendicies"> <xsl:with-param name="mode" select=" 'appendicies' " tunnel="yes" as="xs:string"/> </xsl:apply-templates> </xsl:template> <xsl:template match="/*" mode="#all"> <xsl:param name="mode" tunnel="yes" as="xs:string"/> <div class="{$mode}"> <xsl:apply-templates mode="#current"/> </div> </xsl:template> <!-- Default templates for various modes go here - these can be overridden in importing stylesheets --> </xsl:stylesheet>
Here you use modes to identify each major stage of processing. However, you also pass the current mode’s name in a tunnel parameter. This has two benefits. First, it is useful for debugging templates that match in multiple modes. Second, it allows similar multi-mode templates whose behavior varies by a small amount to implement this variation as a function of the mode parameter, without necessarily having knowledge of the specific modes. For example, if the template attaches CSS styles to the output elements, those styles can be prefixed with the mode name or use some other general mapping (e.g., table lookup) to go from mode to CSS style.
Avoid coupling the initiator of a transformation to its templates that handle it by giving more than one template a chance to handle the request. Rely on template matching rather than conditional logic to determine the appropriate template.
Priorities are key to making this pattern portable because some XSLT
processor may not handle templates with ambiguous template
precedence. This example is adapted from a dynamic web site project I
worked on that also used Cocoon. This project used templatized HTML
where the class
attributes dictated how dynamic
content from an XML database would be rendered into the templated
HTML. I omit the details of each template because they are less
important than the overall structure. In this example, only one
template will match in any given xhtm:td
node, but
in the general case, you can use xsl:next-match
to
combine the effects of multiple matching templates:
<xsl:template match="xhtm:td[matches(@class, '^keep-(w+)')]" mode="template" priority="2.1"> </xsl:template> <xsl:template match="xhtm:td[matches(@class, '^(flow|list)-(w+)')]" mode="template" priority="2.2"> </xsl:template> <xsl:template match="xhtm:td[matches(@class, '^repeat-(w+)')]" mode="template" priority="2.3"> </xsl:template> <xsl:template match="xhtm:td[matches(@class, '^download-(w+)')]" mode="template" priority="2.4"> </xsl:template> <xsl:template match="xhtm:td[matches(@class, '^keep-(w+)')]" mode="template" priority="2.1"> <xsl:template>
You need to transform XML documents that contain chunks of unstructured text that must be marked up into a proper document.
There are three XPath 2.0 function for working with regular
expressions: match()
, replace()
, and tokenize()
. We covered these in
Chapter 1. There is also a new XSLT instruction,
xsl:analyze-string
, which allows you to do even
more advanced text processing.
The xsl:analyze-string
instruction takes a select
attribute for specifying the string to be processed, a regex
attribute for specifying the regular expression to apply to the
string, and an optional flags attribute to modify the action of the
regex engine. The standard flags are:
i
Case-insensitive mode.
m
Multi-line mode makes metacharacters ^ and $
match the beginning and ends of lines rather than the beginning and
end of the entire string (the default).
s
Causes the metacharacter . to
match newlines (entity 

). The default is
not to match newlines. This mode is sometimes called single-line
mode, but from its definition, it should be clear that it is not the
opposite of multi-line mode. Indeed, one can use both the
s
and m
flags together.
x
Allows whitespace to be used in a regular
expression as a separator rather than a significant character.
The child element xsl:matching-substring
is used
to process the substring that matches the regex
and xsl:non-matching-substring
is used to process
the substrings that match the regex
. Either may be
omitted. It is also possible to refer to captured groups (parts of a
regex surrounded by parenthesis) using the
regex-group
function within
xsl:matching-substring
:
<xsl:template match="date"> <xsl:copy> <xsl:analyze-string select="normalize-space(.)" regex="(dddd) ( / | - ) (dd) ( / | - ) (dd)" flags="x"> <xsl:matching-substring> <year><xsl:value-of select="regex-group(1)"/></year> <month><xsl:value-of select="regex-group(3)"/></month> <day><xsl:value-of select="regex-group(5)"/></day> </xsl:matching-substring> <xsl:non-matching-substring> <error><xsl:value-of select="."/></error> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:copy> </xsl:template>
A nice complement to xsl:analyze-string
is the
XSLT function unparsed-text()
. This function
allows you to read the contents of a text file as a string. Thus, as
the name suggests, the file is not parsed and therefore need not be
XML. In fact, except in the most unique of circumstances, you would
not normally use unparsed-text()
on XML content.
The following stylesheet will convert a simple comma delimited file (one with no quoted strings) to XML:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/02/xpath-functions" xmlns:xdt="http://www.w3.org/2005/02/xpath-datatypes"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:param name="csv-file" select=" 'test.csv' "/> <xsl:template match="/"> <converted-csv filename="{$csv-file}"> <xsl:for-each select="tokenize(unparsed-text($csv-file, 'UTF-8'), ' ')"> <xsl:if test="normalize-space(.)"> <row> <xsl:analyze-string select="." regex="," flags="x"> <xsl:non-matching-substring> <col><xsl:value-of select="normalize-space(.)"/></col> </xsl:non-matching-substring> </xsl:analyze-string> </row> </xsl:if> </xsl:for-each> </converted-csv> </xsl:template> </xsl:stylesheet>
The regex capabilities of XSLT 2.0 along with unparsed-text()
open up whole new processing possibilities
to XSLT that were next to impossible in XSLT 1.0. Still, XSLT would
not be my first choice for non-XML processing unless I was working in
a context where a multi-language solution (e.g., Java and XSLT or
Perl and XLST) was not practical. Of course, if XSLT is the only
language you want to master, the new capabilities certainly open up
new vistas for you to explore.
Part of my motivation for jumping the XSLT ship when entering the
domain of unstructured text processing are the
“missing features” of
xsl:analyze-string
. It would be nice if the
position()
and last()
functions worked within xsl:matching-substring
to
tell you that this is match number position()
of
last()
possible matches. I sometimes use
xsl:for-each
over a tokenize()
instead of xsl:analyze-string
but that is also
deficient because it only returns the non-matching portions. Further,
you often feel compelled to use xsl:analyze-string
for a complex parsing problem involving many possible
regex
matches in a regex
using
alternation (|
). However, there is no way to tell
which regex
matched without re-matching using the
match()
function, which is a tad redundant and
wasteful for my taste because surely the regex
engine knows what part it just matched:
<xsl:template match="text()"> <xsl:analyze-string select="." regex='[-+]?d.d+s*[eE][-+]?d+ | [-+]?d+.d+ | [-+]?d+ | "[^"]*?" ' flags="x"> <xsl:matching-substring> <xsl:choose> <xsl:when test="matches(.,'[-+]?d.d+s*[eE][-+]?d+')"> <scientific><xsl:value-of select="."/></scientific> </xsl:when> <xsl:when test="matches(.,'[-+]?d+.d+')"> <decimal><xsl:value-of select="."/> </decimal> </xsl:when> <xsl:when test="matches(.,'[-+]?d+')"> <integer><xsl:value-of select="."/> </integer> </xsl:when> <xsl:when test='matches(.," "" [^""]*? "" ", "x")'> <string><xsl:value-of select="."/></string> </xsl:when> </xsl:choose> </xsl:matching-substring> </xsl:analyze-string> </xsl:template>
Now, hindsight is always 20/20, and there are, of course, all sorts
of implementation issues and tradeoffs that one needs to overcome
when enhancing a language; so, with all due respect to the XSLT 2.0
committee, it would have been sweeter if
xsl:analyze-string
worked as follows:
<!-- NOT VALID XSLT 2.0 - Author's wishful thinking -->
<xsl:template match="text()">
<xsl:analyze-string select="."
flags="x">
<xsl:matching-substring regex="[-+]?d.d+s*[eE][-+]?d+">
<scientific><xsl:value-of select="."/></scientific>
</xsl:matching-substring>
<xsl:matching-substring regex="[-+]?d+.d+'">
<decimal><xsl:value-of select="."/> </decimal>
</xsl:matching-substring>
<xsl:matching-substring regex=" [-+]?d+')">
<integer><xsl:value-of select="."/> </integer>
</xsl:matching-substring>
<xsl:matching-substring regex=' "[^"]*?" '>
<string><xsl:value-of select="."/></string>
</xsl:matching-substring>
<xsl:non=matching-substring>
<other><xsl:value-of select="."/></other>
</xsl:non=matching-substring>
</xsl:analyze-string>
</xsl:template>
You need precise control
of
the serialization of your document and XSLT 1.0’s
disable-output-escaping
feature is too limited for
your needs.
XSLT 2.0 provides a new facility called a character map that provides
precise control of serialization. A character map is designed to be
used with the xsl:output
instruction.
The xsl:character-map
instruction takes the
following attributes:
name
Defines the name of the character map.
use-character-maps
A list of other character maps to incorporate into this one.
The content of an xsl:character-map
is a sequence
of xsl:output-character
elements. These elements
define a mapping between a single Unicode character and a string to
be output in place of the character when that character is
serialized. The following map can be used to output various special
space characters as entities:
<xsl:character-map name="spaces"> <xsl:output-character char=" " string="&npsp;"/> <xsl:output-character char=" " string="&emsp;"/> <xsl:output-character char=" " string="&numsp;"/> <xsl:output-character char=" " string="&puncsp;"/> <xsl:output-character char=" " string="&thincsp;"/> <xsl:output-character char=" " string="&hairsp;"/> </xsl:character-map>
Another more subtle application of character maps is to output non-standard documents that would be difficult to create because they violate the rules of XML or XSLT. Michael Kay gives an example of outputting elements that are commented out in his XSLT Programmer’s Reference, 3rd Edition. Here is a variation on his example. The idea is to generate a copy of the input document but with the content of certain elements commented out with XML comments:
<?xml version="1.0"?> <!-- Define custom enties using the Unicode private use characters --> <!DOCTYPE xsl:stylesheet [ <!ENTITY start-comment ""> <!ENTITY end-comment ""> ]> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <!-- Import the identity transform --> <xsl:import href="copy.xslt"/> <!-- Tell the serializer to use our character map, defined below --> <xsl:output use-character-maps="comment-delimiters"/> <!-- Define a key that will be used to identify elements that should be commented out. --> <xsl:key name="omit-key" match="omit" use="@id"/> <!-- Map our custom entities to strings that form the syntax of XML start and end comments --> <xsl:character-map name="comment-delimiters"> <xsl:output-character character="&start-comment;" string="<!--"/> <xsl:output-character character="&end-comment;" string="-->"/> </xsl:character-map> <!-- Comment out those elements that have an id attribute that matches the id of an omit element from an external document, omit.xml. --> <xsl:template match="*[key('omit-key',@id,doc('omit.xml'))]"> <xsl:text>&start-comment;</xsl:text> <xsl:copy> <xsl:apply select="@* | *"/> </xsl:copy> <xsl:text>&end-comment;</xsl:text> </xsl:template>
Evan Lenz developed an XML-to-string converter that provides an alternative means for dealing with tough serialization problems. See http://xmlportfolio.com/xml-to-string/.
Although most XSLT 1.0 implementations had extensions to help process
multiple documents, they differed quite a bit. XSLT 2.0 provides
xsl:result-document
.
The xsl:result-document
instruction takes the
following attributes:
format
Defines the name of the output format as declared by a named
xsl:output
instruction.
href
Determines the destination where the output document will be serialized.
validation
Determines the validation to be applied to the result tree.
type
Determines the type that should be used to validate the result tree.
Here is an example that splits an XML document into several documents
based on the the groups extracted by an
xsl:for-each-group
. Each output document is named
using the grouping key as a suffix:
<xsl:template match="products"> <xsl:for-each-group select="product" group-by="@type"> <xsl:result-document href="prod-{current-grouping-key()}.xml"> <xsl:copy-of select="current-group()"/> </xsl:result-document> </xsl:for-each-group> </xsl:template>
Sometimes you want a stylesheet that outputs more than one format. For example, the default output format may be XML, but the output you send to an alternate destination may be HTML. For this, you need to take advantage of XSLT 2.0’s ability to specify multiple output formats:
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <!-- Default output format is XML --> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <!-- Another named output format for HTML --> <xsl:output method="html" encoding="UTF-8" indent="yes" name="html-out"/> <xsl:template match="/"> <xsl:apply-templates/> <xsl:result-document href="result.html" format="html-out"> <xsl:apply-templates mode="html"/> </xsl:result-document> </xsl:template> <!-- Other templates here --> </xsl:stylesheet>
Although the use of xsl:result-document
is
straightforward, there is a potential for confusion with another new
XSLT 2.0 instruction, xsl:document
. This is
because the XSLT 1.1 specification (now defunct) also had an
instruction called xsl:document
that had similar
behavior to 2.0’s
xsl:result-document
.
In 2.0, the xsl:document
plays a more limited
role. Its purpose is to construct a document node, presumably because
you want to perform document-level validation without actually
serializing the result. Typically, you will capture the result of
xsl:document
in a variable:
<xsl:variable name="tempDoc" as="document(element(*, my:document))"> <xsl:document type="my:document" validation="strict"> <xsl:apply-templates select="/*"/> </xsl:document> </xsl:variable>
If in later processing you want to output the document, you can use
xsl:result-document
:
<xsl:result-document href="doc.xml"> <xsl:copy-of select="$tempDoc"/> </xsl:result-document>
See Recipe 8.6 for information on XSLT 1.0 extensions to support multiple output documents.
String literals containing quote characters are difficult to deal with in XSLT 1.0 because there is no escape character.
This problem is alleviated by an enhancement that allows a quote
character to be escaped by repeating the character. Here we are
trying to match either double quote delimited strings or single quote
delimited strings. We use single quotes for the
test
attribute so we must use double quotes for
the string literal regex
. This forces us to escape
all literal double quotes by repeating them in the
regex
. The rules of XML force us to use the entity
'
instead of ', but
that is simply to emphasize that XML escaping is a separate issue
that by itself does not provide a solution. In other words, if you
replaced " " by "
, you would make the XML
parser happy, but the XSLT parser would still choke:
<xsl:if test=' matches(., " "" [^""] "" | '[^'] ' ","x") '> </xsl:if>
An equivalent solution is as follows:
<xsl:if test=" matches(., ' " [^"] " | ''[^''] '' ','x') "> </xsl:if>
The lack of an escape character in XSLT 1.0 was frustrating but one could always work around it by using variables and concatenation:
<xsl:variable name="d-quote" select='"'/> <xsl:variable name="s-quote" select="'"/> <xsl:value-of select="concat('He said,', $d-quote, 'John', $s-quote, 's', 'dog turned green.', $d-quote)"/>
There are numerous little enhancements in XSLT 2.0, and it is difficult to get a quick handle on all of them.
Many of the capabilities in 2.0 are delivered as enhancements to existing 1.0 instructions and functions. These are not as obvious as those that are packaged as completely new instructions or functions. This section provides a one-stop overview of these enhancements.
The mode
attribute can take the value
#current
to signify that the processing should
continue with the current mode.
You can take advantage of support for sequences by writing a
comma-separated list (e.g., <xsl:apply-templates
select="title, heading,
para"/>
)
when you want to process title elements first, then heading elements,
and then para elements. In 1.0, you had to write three separate
apply-template instructions.
A big pet peeve of many XSLT 1.0 developers was the inability to
write <xsl:attribute name="foo"
select="10"/>
since the select attribute was not
supported. Now it is supported, and you should prefer it when
defining simple attributes, although the sequence constructor syntax
is still available.
The attributes
type
and validation
were added
in support of schema-aware processors. Type is used to specify native
or user-defined types defined by W3C schema. Validation is used to
specify how the attribute should be validated.
A new attribute, copy-namespaces="yes
|
no
“, is available to specify
whether the namespace nodes of
an element should be copied. The default is yes
,
which is consistent with 1.0 behavior.
The attribute type
was added to specify native or
user-defined types defined by W3C Schema.
The attribute validation
was added to specify how
the result should be validated or whether existing type annotations
should be preserved.
A collation can be provided to specify when two key values match. The available collations are implementation defined.
The terminate
attribute
can now be an attribute value template. This greatly simplifies
global changes to the termination behavior.
A select
attribute is now supported in addition to
the sequence constructor form:
<xsl:param name="terminate" select=" 'no' "/> <xsl:template match="employee"> <xsl:if test="not(@type)"> <xsl:message terminate="{$terminate}" select=" 'Missing type attribute for employee' "/> <xsl:if> </xsl:template>
A select
attribute has been
added
to allow nodes other than the context node to be numbered.
Formatting options have been enhanced to allow output as a word such as “one”, “two”, “three” according to the chosen language:
<-- This outputs 'Ten' --> <xsl:number value="10"format="Ww"
/> <-- This outputs 'ten' --> <xsl:number value="10"format="w"
/> <-- This outputs 'TEN' --> <xsl:number value="10"format="W"
/>
Can be given a name
so that it can be used with the new
xsl:result-document
instruction. See Recipe 6.10 for details.
A new method, XHTML, is supported.
A new attribute, escape-uri-attributes
, determines
whether URI attributes in HTML and XHTML output should be escaped.
A new attribute, include-content-type
, determines
if a <meta>
element should be added to the
output to indicate the content type and encoding.
A new attribute, normalize-unicode
, determines if
Unicode characters should be normalized. See http://www.w3.org/TR/2002/WD-charmod-20020430/
for further details on normalization.
A new attribute, undeclare-namespaces
, determines
if namespaces in XML 1.1 should be undeclared when they go out of
scope. A namespace is undeclared by
xmlns
:pre
="
“,
where pre
is some prefix.
A new attribute, use-character-maps
, allows you to
provide a list of names of character maps. See Recipe 6.8.
An as
attribute can
be
used to specify the type of the parameter.
A required
attribute can be used to specify
whether the parameter is mandatory or optional.
A tunnel
attribute is used to indicate whether
tunneling is supported for this parameter. See Recipe 6.6.
A new
default-validation
attribute determines the default validation to use when new element
and attribute nodes are created and the instruction that creates them
lacks a validation
attribute.
A new xpath-default-namespace
attribute determines
the namespace used for unprefixed element names in XPath expressions.
An as
attribute can be used
to
specify the type of the parameter.
A tunnel
attribute is used to indicate whether
tunneling is supported for this parameter. See Recipe 6.6.
Can now be used with xsl:apply-imports
and the new
xsl:next-match
instruction. See Recipe 6.6.
The function can return generalized items (such as a string) in addition to nodes.
Can now be used within a match pattern to refer to the element that matched.
<!-- Match nodes with descendant elements that have an attribute whose value matches the local name matched element --> <xsl:template match="*[descendant::*/@* = local-name(current())]">
An optional third argument is used to specify the document that should be searched.
This eliminates the need to introduce xsl:for-each
instructions whose only purpose is to switch contents to a new
document so that key() can be used relative to that document:
<!-- Code like this is no longer necessary --> <xsl:for-each select="doc('other.xml')"> <xsl:if test="key('some-key', $val)"> <!-- ...--> </xsl:if> </xsl:for-each> <!-- Write this instead --> <xsl:if test="key('some-key', $val, doc('other.xml'))"> <!-- ...--> </xsl:if>
An xsl:product
property is defined to return the name of
the XSLT processor (e.g., Saxon
).
An xsl:product-version
property is defined to
return the version of the XSLT processor (e.g.,
8.1
).
An xsl:is-schema-aware
property is defined to
return yes
or no
to indicate if
the processor is schema aware.
An xsl:supports-serialization
property is defined
to return yes
or no
to indicate
if the processor supports serialization.
An xsl:supports-backwards-compatibility
property
is defined to return yes
or no
to indicate if the processor supports backward-compatibility mode
(e.g.,
version="1.0
“).
The main themes that govern the new capabilities of old XSLT
instructions are type support and consistency. Here consistency
largely means support for a select
attribute when
only a sequence constructor was supported in the past or visa versa.
One quirk of the XSLT creators is that they continue to introduce two
distinct names for constructs where, to my way of thinking, one would
have sufficed. In particular, consider the attributes
as
and type
. They are never
used together, so why not just type
? A similar
argument could have been made for eliminating
xsl:with-param
in favor of just
xsl:param
when 1.0 was specified.