Chapter 5. XML to Text

Text processing has made it possible to right-justify any idea, even one which cannot be justified on any other grounds.

J. Finegan

In the age of the Internet, formats such as HTML, XHTML, XML, and PDF clearly dominate the application of XSL and XSLT on the output side. However, plain old text will never become obsolete because it is the lowest common denominator in both human- and machine-readable formats. XML is often converted to text for import into another application that does not know how to read XML or does not interpret it the way you prefer. Text output is also used when the result will be sent to a terminal or post-processed in, for example, a Unix pipeline.

Many examples in this section focus on XSLT techniques that create generic XML-to-text converters. Here, generic means that the transformation can be customized easily to work on many different XML inputs or produce a variety of outputs, or both. The techniques employed in these examples have application beyond the specifics of a given recipe and often beyond the domain of text processing. In particular, you may want to look at Recipe 5.2 through Recipe 5.5, even if they do not address a present need.

Of all the output formats supported by xsl:output, text is the one for which managing whitespace is the most crucial. For this reason, this chapter addresses the issue separately in Recipe 5.1. Developers inexperienced in XML and XSLT are often vexed by what seems fickle treatment of whitespace. However, once you understand the rules and techniques for exploiting the rules, it is easier to create output that is formatted correctly.

Source-code generation from XML is arguably in the domain of XML-to-text transformation. However, code generation involves issues that transcend mere transformation and formatting. Chapter 10 will deal with code generation as a subject unto itself.

Dealing with Whitespace

Problem

You need to convert XML into formatted text, but whitespace issues are ruining the results.

Solution

Consider the following annotated XML sample. The symbols

Solution

(newline), (tab), and

Solution

(space) mark whitespace-only text nodes that are often overlooked but subject to being copied to the output:

<review>Solution
               <author>Solution
               
                <name>Sal Mangano</name>Solution
               
               <email>[email protected]</email>Solution
               </author>Solution
<title>XSLT Cookbook</title>Solution
<reviewer>Solution
               
               <name>Solution<annonymous/>Solution</name>Solution
               
               <email>[email protected]</email>Solution
               
               <comment>Solution
Totally awesome. <b>Worth every cent!</b>Solution<i>Must buy because I Solution
know the author peronally and he can sure use the money.</i>Solution
               
               </comment>Solution
               </reviewer>Solution
</review>Solution
            

Too much whitespace

  1. Use xsl:strip-space to get rid of whitespace-only nodes.

    This top-level element with a single attribute, elements, is assigned a whitespace-separated list of element names that you want stripped of extra whitespace. Here, extra whitespace means whitespace-only text nodes. This means, for example, that the whitespace separating words in the previous comment element are significant because they are not whitespace only. On the other hand, the whitespace designated by the special symbols are whitespace only.

    A common idiom uses <xsl:strip-space elements="*"/> to strip whitespace by default and xsl:preserve-space (see later) to override specific elements.

  2. Use normalize-space to get rid of extra whitespace.

    A common mistake is to assume that xsl:strip-space takes care of “extra” whitespace like that used to align text in the previous comment element. This is not the case. The parser always considers significant whitespace inside an element’s text that is mixed with nonwhitespace. To remove this extra space, use normalize-space, as in <xsl:value-of select="normalize-space(comment)"/>.

  3. Use translate to get rid of all whitespace.

    Another common mistake is to assume normalize-space strips all whitespace. This is not the case. Instead, it strips only leading and trailing whitespace and converts multiple internal whitespace characters to single spaces. If you need to strip all whitespace, use translate( something ,'&#x20;&#xa;&#xd; &#x9;', '').

  4. Use an empty xsl:text element to prevent terminating whitespace in the stylesheet from being considered relevant.

    xsl:text is normally considered a way to preserve whitespace. However, a strategically placed empty xsl:text element can prevent trailing whitespace in the stylesheet from being interpreted as significant.

Consider the results of the two modes in the following document and stylesheet, shown in Example 5-1 to Example 5-3.

Example 5-1. Input

<numbers>
  <number>10</number>
  <number>3.5</number>
  <number>4.44</number>
  <number>77.7777</number>
</numbers>

Example 5-2. Processing numbers with and without an empty xsl:text element

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
   
<xsl:template match="numbers">
Without empty text element:
<xsl:apply-templates mode="without"/>
With empty text element:
<xsl:apply-templates mode="with"/>
</xsl:template>   
   
<xsl:template match="number" mode="without">
  <xsl:value-of select="."/>,
</xsl:template>
   
<xsl:template match="number" mode="with">
  <xsl:value-of select="."/>,<xsl:text/>
</xsl:template>
   
</xsl:stylesheet>

Example 5-3. Output

Without empty text element:
10,
3.5,
4.44,
77.7777,
   
With empty text element:
10,3.5,4.44,77.7777,

Note that there is nothing magical about xsl:text when it is used this way. It works just as well if you replace <xsl:text/> with <xsl:if test="0"/> (but don’t do so unless you enjoy confusing others). The effect is the placement of an element node between the comma and the trailing newline, which creates a whitespace-only node that will be ignored.

Too little whitespace

  1. Use xsl:preserve-space to override xsl:strip-space for specific elements.

    There is not much point in using xsl:preserve-space unless you also use xsl:strip-space. This is because the default behavior preserves space in the input document and documents loaded with the document( ) function.

    Warning

    Remember, MSXML strips whitespace-only text nodes by default. In this case, you can use xsl:preserve-space to counteract this nonconformance.

  2. Use xsl:text to precisely specify text-output spacing.

    All whitespace inside an xsl:text element is preserved. This preservation allows precise control over whitespace placement. Sometimes you can use xsl:text to simply introduce line breaks:

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
       
    <xsl:output method="text"/>
    <xsl:strip-space elements="*"/>
       
    <xsl:template match="number">
      <xsl:value-of select="."/>
      <xsl:text>&#xa;</xsl:text>
    </xsl:template>
    </xsl:stylesheet>

    However, the problem with outputting newline characters directly is that some platforms (e.g., Microsoft’s) expect a line break to be represented as carriage- return plus newline. However, since XML parsers are required to convert carriage return plus newline into a single newline, there is no way to create a platform-independent stylesheet. Fortunately, most Windows-based editors and the Windows command prompt handle single newlines correctly. The one exception is the notepad editor that comes free with Windows.

  3. Use nonbreaking space characters.

    XSLT does not treat character &#A0; (nonbreaking space) as normal whitespace. In particular, xsl:strip-space and normalize-space( ) both ignore this character. If you need to strip whitespace most of the time but have specific instances when it should remain in place, you might try to use this character in the XML input. Nonbreaking space is particularly useful for HTML output, but may be of lesser value in other contexts (depending on how the renderer handles it).

Discussion

The solution section lists techniques for managing whitespace. However, knowing the XSLT rules that underlie the techniques is also useful.

The most important rules to know applies to both the stylesheet and input document(s):

  1. A text node is never stripped unless it contains only whitespace characters (#x20, #x9, #xD, or #xA).

    Although they are not all that common, you should also understand the effect of xml:space attributes in both the stylesheet and the input document(s).

  2. If a text-node’s ancestor element has an xml:space attribute with a value of preserve, and no closer ancestor element has xml:space with a value of default, then whitespace-only text nodes are not stripped.

    The chapter now looks at the rules for stylesheets and source documents separately. For stylesheets, your options are simple.

  3. The only stylesheet elements for which whitespace-only nodes are preserved by default are xsl:text. Here, “by default” means unless otherwise specified using xml:space="preserve" as stated earlier in Step 2,. See Example 5-4 and Example 5-5.

Example 5-4. Stylesheet demonstrating the effect of xsl:text and xml:space=preserve

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   
<xsl:output method="text"/>
   
<xsl:strip-space elements="*"/>
   
<xsl:template match="numbers">
Without xml:space="preserve":
<xsl:apply-templates mode="without-preserve"/>
With xml:space="preserve":
<xsl:apply-templates mode="with-preserve"/>
</xsl:template>
   
<xsl:template match="number" mode="without-preserve">
  <xsl:value-of select="."/><xsl:text> </xsl:text>
</xsl:template>
   
<xsl:template match="number" mode="with-preserve" xml:space="preserve">
  <xsl:value-of select="."/><xsl:text> </xsl:text>
</xsl:template>
   
</xsl:stylesheet>

Example 5-5. Output

Without xml:space="preserve":
10 3.5 4.44 77.7777
With xml:space="preserve":
   
  10
   
  3.5
   
  4.44
   
  77.7777

The only whitespace introduced by the first number match is the single space contained in the xsl:text element. However, when you use xml:space="preserve" in the second number match template, you pick up all the whitespace contained in the element including the two line breaks (the first is after the <xsl:template ...> and the second is after the </xsl:text>).

For source documents, the rules are as follows:

  • Initially, the list of elements in which whitespace is preserved includes all elements in the document.

  • If an element matches a NameTest in an xsl:strip-space element, then it is removed from the list of whitespace-preserving element names.

  • If an element name matches a NameTest in an xsl:preserve-space element, then it is added to the list of whitespace-preserving element names.

A NameTest is either a simple name (e.g., doc) or a name with a namespace prefix (e.g., my:doc), wildcard (e.g., *), or a wildcard with a namespace prefix (e.g., my:*). The default priority and import precedence rules apply when conflicts exist between xml:strip-space and xml:preserve-space:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
xmlns:my="http://www.ora.com/XSLTCookbook/ns/my">
   
<!-- Strip whitespace in all elements -->
<xsl:strip-space="*"/>
   
<!-- except those in the "my" namespace -->
<xsl:preserve-space="my:*"/>
   
<!-- and those named foo -->
<xsl:preserve-space="foo"/>

See Also

One of the most complete discussions of whitespace handling is in Michael Kay’s XSLT Programmer’s Reference (Wrox, 2001). The book also includes good coverage of the import precedence rules.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset