Text processing has made it possible to right-justify any idea, even one which cannot be justified on any other grounds.
In the age of the Internet, formats such as HTML, XHTML, XML, and PDF clearly dominate the application of XSL and XSLT on the output side. However, plain old text will never become obsolete because it is the lowest common denominator in both human- and machine-readable formats. XML is often converted to text for import into another application that does not know how to read XML or does not interpret it the way you prefer. Text output is also used when the result will be sent to a terminal or post-processed in, for example, a Unix pipeline.
Many examples in this section focus on XSLT techniques that create generic XML-to-text converters. Here, generic means that the transformation can be customized easily to work on many different XML inputs or produce a variety of outputs, or both. The techniques employed in these examples have application beyond the specifics of a given recipe and often beyond the domain of text processing. In particular, you may want to look at Recipe 5.2 through Recipe 5.5, even if they do not address a present need.
Of all the output formats supported by xsl:output
,
text is the one for which managing whitespace is the most crucial.
For this reason, this chapter addresses the issue separately in Recipe 5.1. Developers inexperienced in XML and XSLT are
often vexed by what seems fickle treatment of whitespace. However,
once you understand the rules and techniques for exploiting the
rules, it is easier to create output that is formatted correctly.
Source-code generation from XML is arguably in the domain of XML-to-text transformation. However, code generation involves issues that transcend mere transformation and formatting. Chapter 10 will deal with code generation as a subject unto itself.
Consider the following annotated XML sample. The symbols
(newline), → (tab), and
(space) mark whitespace-only text nodes that are often overlooked but subject to being copied to the output:
<review>
→<author>
→
→ <name>Sal Mangano</name>
→
→<email>[email protected]</email>
→</author>
<title>XSLT Cookbook</title>
<reviewer>
→
→<name><annonymous/></name>
→
→<email>[email protected]</email>
→
→<comment>
Totally awesome. <b>Worth every cent!</b><i>Must buy because I
know the author peronally and he can sure use the money.</i>
→
→</comment>
→</reviewer>
</review>
Use
xsl:strip-space
to get rid of whitespace-only
nodes.
This top-level element with a single attribute,
elements
, is assigned a whitespace-separated list
of element names that you want stripped of extra whitespace. Here,
extra whitespace means whitespace-only text nodes. This means, for
example, that the whitespace separating words in the previous
comment
element are significant because they are
not whitespace only. On the other hand, the whitespace designated by
the special symbols are whitespace only.
A common idiom uses <xsl:strip-space elements="*"/>
to strip whitespace by default and
xsl:preserve-space
(see later) to override
specific elements.
Use normalize-space
to get rid of extra whitespace.
A common mistake is to assume that xsl:strip-space
takes care of “extra” whitespace
like that used to align text in the previous
comment
element. This is not the case. The parser
always considers significant whitespace inside an
element’s text that is mixed with nonwhitespace. To
remove this extra space, use normalize-space
, as
in <xsl:value-of select="normalize-space(comment)"/>
.
Use
translate
to get rid of all whitespace.
Another common mistake is to assume
normalize-space
strips all whitespace. This is not
the case. Instead, it strips only leading and trailing whitespace and
converts multiple internal whitespace characters to single spaces. If
you need to strip all whitespace, use
translate(
something
,' 

 	', '')
.
Use an
empty xsl:text
element to prevent terminating
whitespace in the stylesheet from being considered relevant.
xsl:text
is normally considered a way to preserve
whitespace. However, a strategically placed empty
xsl:text
element can prevent trailing whitespace
in the stylesheet from being interpreted as significant.
Consider the results of the two modes in the following document and stylesheet, shown in Example 5-1 to Example 5-3.
Example 5-1. Input
<numbers> <number>10</number> <number>3.5</number> <number>4.44</number> <number>77.7777</number> </numbers>
Example 5-2. Processing numbers with and without an empty xsl:text element
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:strip-space elements="*"/> <xsl:template match="numbers"> Without empty text element: <xsl:apply-templates mode="without"/> With empty text element: <xsl:apply-templates mode="with"/> </xsl:template> <xsl:template match="number" mode="without"> <xsl:value-of select="."/>, </xsl:template> <xsl:template match="number" mode="with"> <xsl:value-of select="."/>,<xsl:text/> </xsl:template> </xsl:stylesheet>
Example 5-3. Output
Without empty text element: 10, 3.5, 4.44, 77.7777, With empty text element: 10,3.5,4.44,77.7777,
Note that there is nothing magical about xsl:text
when it is used this way. It works just as well if you replace
<xsl:text/>
with <xsl:if test="0"/>
(but don’t do so unless you
enjoy confusing others). The effect is the placement of an element
node between the comma and the trailing newline, which creates a
whitespace-only node that will be ignored.
Use
xsl:preserve-space
to override
xsl:strip-space
for specific elements.
There is not much point in using
xsl:preserve-space
unless you also use
xsl:strip-space
. This is because the default
behavior preserves space in the input document and documents loaded
with the document( )
function.
Use
xsl:text
to precisely specify text-output spacing.
All whitespace inside an xsl:text
element is
preserved. This preservation allows precise control over whitespace
placement. Sometimes you can use xsl:text
to
simply introduce line breaks:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:strip-space elements="*"/> <xsl:template match="number"> <xsl:value-of select="."/> <xsl:text>
</xsl:text> </xsl:template> </xsl:stylesheet>
However, the problem with outputting newline characters directly is that some platforms (e.g., Microsoft’s) expect a line break to be represented as carriage- return plus newline. However, since XML parsers are required to convert carriage return plus newline into a single newline, there is no way to create a platform-independent stylesheet. Fortunately, most Windows-based editors and the Windows command prompt handle single newlines correctly. The one exception is the notepad editor that comes free with Windows.
Use nonbreaking space characters.
XSLT does not treat character &#A0; (nonbreaking space) as normal
whitespace. In particular, xsl:strip-space
and
normalize-space( )
both ignore this character. If
you need to strip whitespace most of the time but have specific
instances when it should remain in place, you might try to use this
character in the XML input. Nonbreaking space is particularly useful
for HTML output, but may be of lesser value in other contexts
(depending on how the renderer handles it).
The solution section lists techniques for managing whitespace. However, knowing the XSLT rules that underlie the techniques is also useful.
The most important rules to know applies to both the stylesheet and input document(s):
A text node is never stripped unless it contains only whitespace characters (#x20, #x9, #xD, or #xA).
Although they are not all that common, you should also understand the
effect of xml:space
attributes in both the
stylesheet and the input document(s).
If a text-node’s ancestor element has an
xml:space
attribute with a value of
preserve
, and no closer ancestor element has
xml:space
with a value of
default
, then whitespace-only text nodes are not
stripped.
The chapter now looks at the rules for stylesheets and source documents separately. For stylesheets, your options are simple.
The only stylesheet elements for
which whitespace-only nodes are preserved by default are
xsl:text
. Here, “by
default” means unless otherwise specified using
xml:space="preserve
" as stated
earlier in Step 2,. See Example 5-4 and Example 5-5.
Example 5-4. Stylesheet demonstrating the effect of xsl:text and xml:space=preserve
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:strip-space elements="*"/> <xsl:template match="numbers"> Without xml:space="preserve": <xsl:apply-templates mode="without-preserve"/> With xml:space="preserve": <xsl:apply-templates mode="with-preserve"/> </xsl:template> <xsl:template match="number" mode="without-preserve"> <xsl:value-of select="."/><xsl:text> </xsl:text> </xsl:template> <xsl:template match="number" mode="with-preserve" xml:space="preserve"> <xsl:value-of select="."/><xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet>
Example 5-5. Output
Without xml:space="preserve": 10 3.5 4.44 77.7777 With xml:space="preserve": 10 3.5 4.44 77.7777
The only whitespace introduced by the first number match is the
single space contained in the xsl:text
element.
However, when you use xml:space="preserve"
in the
second number match template, you pick up all the whitespace
contained in the element including the two line breaks (the first is
after the <xsl:template ...>
and the second
is after the </xsl:text>
).
For source documents, the rules are as follows:
Initially, the list of elements in which whitespace is preserved includes all elements in the document.
If an element matches a NameTest
in an
xsl:strip-space
element, then it is removed from
the list of whitespace-preserving element names.
If an element name matches a NameTest
in an
xsl:preserve-space
element, then it is added to
the list of whitespace-preserving element names.
A NameTest
is either a simple name (e.g.,
doc
) or a name with a namespace prefix (e.g.,
my:doc
), wildcard (e.g., *
), or
a wildcard with a namespace prefix (e.g., my:*
).
The default priority and import precedence rules apply when conflicts
exist between xml:strip-space
and
xml:preserve-space
:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:my="http://www.ora.com/XSLTCookbook/ns/my"> <!-- Strip whitespace in all elements --> <xsl:strip-space="*"/> <!-- except those in the "my" namespace --> <xsl:preserve-space="my:*"/> <!-- and those named foo --> <xsl:preserve-space="foo"/>