2. Strings

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. Strings

I believe everybody in the world should have guns. Citizens should have bazookas and rocket launchers too. I believe that all citizens should have their weapons of choice. However, I also believe that only I should have the ammunition. Because frankly, I wouldn’t trust the rest of the goobers with anything more dangerous than [a] string.

—Scott Adams

Introduction

When it comes to manipulating strings, XSLT 1.0 certainly lacks the heavy artillery of Perl. XSLT is a language optimized for processing XML markup, not strings. However, since XML is simply a structured form of text, string processing is inevitable in all but the most trivial transformation problems. Unfortunately, XSLT 1.0 has only nine standard functions for string processing. Java, on the other hand, has about two dozen, and Perl, the undisputed king of modern text-processing languages, has a couple dozen plus a highly advanced regular-expression engine.

With the emergence of XSLT 2.0 implementations, XSLT developers can dispense with their Perl string envy. XPath 2.0 now provides 20 functions related to string processing. The functions include support for regular expressions. In addition, XSLT 2.0 adds facilities for parsing unstructured text via regular expressions so it can be converted to proper XML.

XSLT 1.0 programmers have two choices when they need to perform advanced string processing. First, they can call out to external functions written in Java or some other language supported by their XSLT processor. This can be extremely convenient if portability is not an issue and fairly heavy-duty string manipulation is needed. Second, they can implement the advanced string-handling functionality directly in XSLT. This chapter shows that quite a bit of common string manipulation can be done within the confines of XSLT 1.0 and also how the same problems are more easily handled in XSLT 2.0.

You can implement advanced string functions in XSLT 1.0 by combining the capabilities of the native string functions and by exploiting the power of recursion, which is an integral part of all advanced uses of XSLT. In fact, recursion is such an important technique in XSLT that it is worthwhile to look through some of these recipes even if you have no intention of implementing your string-processing needs directly in XSLT.

This book also refers to the excellent work of EXSLT.org, a community initiative that helps standardize extensions to the XSLT language. You may want to check out their site at http://www.exslt.org.

Tip

When I implement a solution in XSLT 2.0 that is more than a line or so of code, I use the new XSLT 2.0 ability to write first-class XPath functions in XSLT. By contrast, the 1.0 solutions use named templates, which you can invoke only via xsl:call-template.

2.1. Testing If a String Ends with Another String

Problem

You need to test if a string ends with a particular substring.

Solution

XSLT 1.0

substring($value, (string-length($value) - string-length($substr)) + 1) = $substr

XSLT 2.0

ends-with($value, $substr)

Discussion

XSLT 1.0 contains a native starts-with() function but no ends-with() . This is rectified in 2.0. However, as the previous 1.0 code shows, ends-with can be implemented easily in terms of substring() and string-length() . The code simply extracts the last string-length($substr) characters from the target string and compares them to the substring.

Warning

Programmers accustomed to having the first position in a string start at index 0 should note that XSLT strings start at index 1.

2.2. Finding the Position of a Substring

Problem

You want to find the index of a substring within a string rather than the text before or after the substring.

Solution

XSLT 1.0

<xsl:template name="string-index-of">
     <xsl:param name="input"/>
     <xsl:param name="substr"/>
<xsl:choose>
     <xsl:when test="contains($input, $substr)">
          <xsl:value-of select="string-length(substring-before($input, $substr))+1"/>
     </xsl:when>
     <xsl:otherwise>0</xsl:otherwise>
</xsl:choose>
</xsl:template>

XSLT 2.0

<xsl:function name="ckbk:string-index-of">
  <xsl:param name="input"/>
  <xsl:param name="substr"/>
  <xsl:sequence select="if (contains($input, $substr)) 
                        then string-length(substring-before($input, $substr))+1 
                        else 0"/>
</xsl:function>

Discussion

The position of a substring within another string is simply the length of the string preceding it plus 1. If you are certain that the target string contains the substring, then you can simply use string-length(substring-before($value, $substr))+1. However, in general, you need a way to handle the case in which the substring is not present. Here, zero is chosen as an indication of this case, but you can use another value such as -1 or NaN.

2.3. Removing Specific Characters from a String

Problem

You want to strip certain characters (e.g., whitespace) from a string.

Solution

XSLT 1.0

Use translate with an empty replace string. For example, the following code can strip whitespace from a string:

translate($input," &#x9;&#xa;&xd;", "")

XSLT 2.0

Using translate() is still a good idea in XSLT 2.0 because it will usually perform best. However, some string removal tasks are much more naturally implemented using regular expressions and the new replace() function:

(: s matches all whitespace characters :)
replace($input,"s","")

Discussion

translate() is a versatile string function that is often used to compensate for missing string-processing capabilities in XSLT 1.0. Here you use the fact that translate() will not copy characters in the input string that are in the from string but do not have a corresponding character in the to string.

You can also use translate to remove all but a specific set of characters from a string. For example, the following code removes all non-numeric characters from a string:

translate($string, 
          translate($string,'0123456789',''),'')

The inner translate() removes all characters of interest (e.g., numbers) to obtain a from string for the outer translate(), which removes these non-numeric characters from the original string.

Sometimes you do not want to remove all occurrences of whitespace, but instead want to remove leading, trailing, and redundant internal whitespace. XPath has a built-in function, normalize-space( ), which does just that. If you ever needed to normalize based on characters other than spaces, then you might use the following code (where C is the character you want to normalize):

translate(normalize-space(translate($input,"C "," C")),"C "," C")

However, this transformation won’t work quite right if the input string contains whitespace characters other than spaces; i.e., tab (#x9), newline (#xA), and carriage return (#xD). The reason is that the code swaps space with the character to normalize, and then normalizes the resulting spaces and swaps back. If nonspace whitespace remains after the first transformation, it will also be normalized, which might not be what you want. Then again, the applications of non-whitespace normalizing are probably rare anyway. Here you use this technique to remove extra - characters:

<xsl:template match="/">
  <xsl:variable name="input" 
       select=" '---this --is-- the way we normalize non-whitespace---' "/>
 <xsl:value-of 
      select="translate(normalize-space(
                                 translate($input,'- ',' -')),'- ',' -')"/>
</xsl:template>

The result is:

this -is- the way we normalize non-whitespace

XSLT 2.0

Another more powerful way to remove undesired characters from a string is the use of the XSLT 2.0 replace() function, which harnesses the power of regular expressions. Here we use replace() to normalize non-whitespace without the caveats of our XSLT 1.0 solution:

<xsl:template match="/">
 <xsl:variable name="input" 
      select=" '---this --is-- the way we normalize non-whitespace---' "/>
<xsl:value-of select="replace(replace($input,'-+','-'),'^-|-$','')"/>
</xsl:template>

This code uses two calls to replace. The inner call replaces multiple occurrences of -with a single - and the outer call removes leading and trailing - characters.

This chapter introduces one of the veteran programmer’s favorite tools for advanced string manipulation: regular expressions (affectionately known as regex). The addition of regex capabilities to XSLT was on the top 10 list of almost every XSLT developer I know. This sidebar is intended for those developers who have not had the pleasure of working with regular expressions or who are too intimidated by them. This is not an exhaustive reference, but it should get you going.

A regex is a string that encodes a pattern to match in another string. The simplest pattern is a literal string itself — that is, the string “foo” can be used as a regular expression. It will match the string “foobar” starting at the first character. However, the real power of regular expressions is revealed only when you begin to wield the special meta-characters recognized by the language.

The most important meta-characters are those used to construct wildcards.

A period or dot (.) matches a single character.
A character class ([aeiou], [a-z], or [a-zA-Z]) matches a list, range, or combination of lists and ranges of characters.
Some character classes that are common are given special abbreviations. For example, s is an abbreviation for whitespace characters including space, tab, carriage return, and new line, and d is short for [0-9]. When there is a backslash abbreviation for a character class, it is often the case that the uppercase version inverts the match. So, for example, S matches non-whitespace and D matches a non-digit. This is not universally true. For example, matches a newline, but N does not mean non-newline (this also goes for - tab and - carriage return).
One can negate a character class by beginning it with a ^. For example, [^aeiou] matches any character except these lowercase vowels. This also applies to ranges; [^0-9] is the same as D.
Literals and wildcards are often mixed together. For example, d[aeiou]g matches "dag“, "deg“, "dig“, "dog“, and "dug“, as well as any longer string that has these as substrings.
Equally important are the repetition metacharacters that allow preceding characters, wildcards, or combinations thereof to match repeatedly.
The * meta character means to match the previous expression 0 or more times. Hence, be* matches strings containing "b“, "be“, "bee“, "beee“, and so on. (10)* matches strings containing "10“, "1010“, "101010“, and so on. Here the parenthesis acts as a grouping construct. If you remove the parenthesis, you get 10*, and the repetition applies only to the 0.
The + meta character means to match the previous expression one or more times. Hence, be+ matches strings containing "be“, "bee“, "beee", and so on, but not "b“.
The ? metacharacter means match the previous expression zero or one time. Hence, be? matches strings containing "b" and "be“.
Very often one needs to be specific with respect to where a regular expression matches. In particular, you will often only want to match a pattern at the start (^) or end ($) of a string, and sometimes you will want to match only if the pattern is anchored at both the start and the end. For example, "^be+" will match "bee keeper" but not "has been“. The regex "be+$" will match "to be or not to be" but not "be he alive or be he dead“. Further, "^be+$" will match "be" and "bee" but not "been" or "Abe“.
The regex machinery presented thus far can handle most of the matching tasks you are likely to encounter. However, there are some so-called context-sensitive matches that cannot be handled by simple regex patterns. Consider wanting to match numbers that start and end with the same digit (11, 909, 3233, etc.). Pure regular expressions can’t do this, but most regex engines, including the one specified for XPath 2.0, provide extensions to make this possible.
The facility requires two conventions. The first requires you to mark the portion of the pattern you wish to later reference with a captured group using parentheses, and the second requires you to reference the group by an index variable. For example, (d)d*1 is a regex that matches any number that starts and ends in the same digit. The group is the first digit (d) and the reference is 1, which means “whatever the first group matched.” As you might guess, you can have multiple groups such as (d)(d)12, which will match numbers like "1212" and "9999" but not "1213" or "1221“. Back references like 1, 2, etc. are used with the XPath 2.0 matches() function. A similar notation using a $ instead of a is reserved for cases where the reference occurs outside of the regular expression itself. This occurs in the function replace() where you want to refer to groups in the matching regex from the replacement regex. For example, replace($someText, `(d)d*', `$1') will replace the first sequence of 1 or more digits in $someText with the first digit in that sequence. This facility is also available in the xsl:analyze-string instruction. We discuss these facilities in more detail in Recipes Recipe 2.6 and Recipe 2.10.

If you want to explore the world of regular expressions in more depth, you should check out Mastering Regular Expressions, Second Edition by Jeffery E. F. Friedl (O’Reilly, 1999). If you want more depth on XSLT 2.0’s regex flavor, consider XPath 2.0 by Michael Kay (Wrox, 2004) or the W3C recommendation at http://www.w3.org/TR/xquery-operators#string.match and http://www.w3.org/TR/xmlschema-2/#regexs.

2.4. Finding Substrings from the End of a String

Problem

XSLT does not have any functions for searching strings in reverse.

Solution

XSLT 1.0

Using recursion, you can emulate a reverse search with a search for the last occurrence of substr. Using this technique, you can create a substring-before-last and a substring-after-last:

<xsl:template name="substring-before-last">
  <xsl:param name="input" />
  <xsl:param name="substr" />
  <xsl:if test="$substr and contains($input, $substr)">
    <xsl:variable name="temp" select="substring-after($input, $substr)" />
    <xsl:value-of select="substring-before($input, $substr)" />
    <xsl:if test="contains($temp, $substr)">
      <xsl:value-of select="$substr" />
      <xsl:call-template name="substring-before-last">
        <xsl:with-param name="input" select="$temp" />
        <xsl:with-param name="substr" select="$substr" />
      </xsl:call-template>
    </xsl:if>
  </xsl:if>
</xsl:template>
   
<xsl:template name="substring-after-last">
<xsl:param name="input"/>
<xsl:param name="substr"/>
   
<!-- Extract the string which comes after the first occurrence -->
<xsl:variable name="temp" select="substring-after($input,$substr)"/>
   
<xsl:choose>
     <!-- If it still contains the search string the recursively process -->
     <xsl:when test="$substr and contains($temp,$substr)">
          <xsl:call-template name="substring-after-last">
               <xsl:with-param name="input" select="$temp"/>
               <xsl:with-param name="substr" select="$substr"/>
          </xsl:call-template>
     </xsl:when>
     <xsl:otherwise>
          <xsl:value-of select="$temp"/>
     </xsl:otherwise>
</xsl:choose>
</xsl:template>

XSLT 2.0

XSLT 2.0 does not add reverse versions of substring-before/after, but one can get the desired effect using the versatile tokenize( ) function that uses regular expressions:

<xsl:function name="ckbk:substring-before-last">
    <xsl:param name="input" as="xs:string"/>
    <xsl:param name="substr" as="xs:string"/>
    <xsl:sequence 
       select="if ($substr) 
               then 
                  if (contains($input, $substr)) then 
                  string-join(tokenize($input, $substr)
                    [position() ne last()],$substr) 
                  else ''
               else $input"/>
</xsl:function>

<xsl:function name="ckbk:substring-after-last">
    <xsl:param name="input" as="xs:string"/>
    <xsl:param name="substr" as="xs:string"/>
    <xsl:sequence 
    select="if ($substr) 
            then
               if (contains($input, $substr))
               then tokenize($input, $substr)[last()] 
               else '' 
            else $input"/>
</xsl:function>

In both functions, we have to test if substring is empty because tokenize does not allow an empty search pattern. Unfortunately, these implementations won’t work exactly like their native counterparts. This is because tokenize treats its second argument as a regular, not a literal, string. This could lead to some surprises. You can fix this by having the function escape the special characters used in regular expression. You can switch this behavior on and off via a third Boolean argument. The original two-argument version and this new three-argument version can coexist because XSLT allows functions to be overloaded (a function is defined by its name and its arity or number of arguments):

<xsl:function name="ckbk:substring-before-last">
    <xsl:param name="input" as="xs:string"/>
    <xsl:param name="substr" as="xs:string"/>
    <xsl:param name="mask-regex" as="xs:boolean"/>
    <xsl:variable name="matchstr" 
               select="if ($mask-regex) 
                          then replace($substr,'([.+?*^$])','$1')
                          else $substr"/>

    <xsl:sequence select="ckbk:substring-before-last($input,$matchstr)"/>
</xsl:function>

<xsl:function name="ckbk:substring-after-last">
    <xsl:param name="input" as="xs:string"/>
    <xsl:param name="substr" as="xs:string"/>
    <xsl:param name="mask-regex" as="xs:boolean"/>
    <xsl:variable name="matchstr" 
               select="if ($mask-regex) 
                          then replace($substr,'([.+?*^$])','$1')
                          else $substr"/>

    <xsl:sequence select="ckbk:substring-after-last($input,$matchstr)"/>
</xsl:function>

Discussion

Both XSLT string-searching functions (substring-before and substring-after) begin searching at the start of the string. Sometimes you need to search a string from the end. The simplest way to do this in XSLT is to apply the built-in search functions recursively until the last instance of the substring is found.

Warning

There was a nasty “gotcha” in my first attempt at these templates, which you should keep in mind when working with recursive templates that search strings. Recall that contains($anything,'') will always return true! For this reason, I make sure that I also test the existence of a non-null $substr value in the recursive invocations of substring-before-last and substring-after-last. Without these checks, the code will go into an infinite loop for null search input or overflow the stack on implementations that do not handle tail recursion.

Another algorithm is divide and conquer. The basic idea is to split the string in half. If the search string is in the second half, then you can discard the first half, thus turning the problem into a problem half as large. This process repeats recursively. The tricky part is when the search string is not in the second half because you may have split the search string between the two halves. Here is a solution for substring-before-last:

<xsl:template name="str:substring-before-last"> 
   
  <xsl:param name="input"/>
  <xsl:param name="substr"/>
  
  <xsl:variable name="mid" select="ceiling(string-length($input) div 2)"/>
  <xsl:variable name="temp1" select="substring($input,1, $mid)"/>
  <xsl:variable name="temp2" select="substring($input,$mid +1)"/>
  <xsl:choose>
    <xsl:when test="$temp2 and contains($temp2,$substr)">
      <!-- search string is in second half so just append first half -->
      <!-- and recurse on second -->
      <xsl:value-of select="$temp1"/>
      <xsl:call-template name="str:substring-before-last">
        <xsl:with-param name="input" select="$temp2"/>
        <xsl:with-param name="substr" select="$substr"/>
      </xsl:call-template>
    </xsl:when>
    <!--search string is in boundary so a simple substring-before -->
    <!-- will do the trick-->
    <xsl:when test="contains(substring($input,
                                       $mid - string-length($substr) +1),
                                       $substr)">
      <xsl:value-of select="substring-before($input,$substr)"/>
    </xsl:when>
    <!--search string is in first half so throw away second half-->
    <xsl:when test="contains($temp1,$substr)">
      <xsl:call-template name="str:substring-before-last">
      <xsl:with-param name="input" select="$temp1"/>
      <xsl:with-param name="substr" select="$substr"/>
      </xsl:call-template>
    </xsl:when>
    <!-- No occurrences of search string so we are done -->
    <xsl:otherwise/>
  </xsl:choose>
  
</xsl:template>

As it turns out, divide and conquer is of little or no advantage unless you search large texts (roughly 4,000 characters or more). You might have a wrapper template that chooses the appropriate algorithm based on the length or switches from divide and conquer to the simpler algorithm when the subpart becomes small enough.

2.5. Duplicating a String N Times

Problem

You need to duplicate a string N times, where N is a parameter. For example, you might need to pad out a string with spaces to achieve alignment.

Solution

XSLT 1.0

A nice solution is a recursive approach that doubles the input string until it is the required length while being careful to handle cases in which $count is odd:

<xsl:template name="dup">
     <xsl:param name="input"/>
     <xsl:param name="count" select="2"/>
     <xsl:choose>
          <xsl:when test="not($count) or not($input)"/>
          <xsl:when test="$count = 1">
               <xsl:value-of select="$input"/>
          </xsl:when>
          <xsl:otherwise>
               <!-- If $count is odd append an extra copy of input -->
               <xsl:if test="$count mod 2">
                    <xsl:value-of select="$input"/>
               </xsl:if>
               <!-- Recursively apply template after doubling input and 
               halving count -->
               <xsl:call-template name="dup">
                    <xsl:with-param name="input" 
                         select="concat($input,$input)"/>
                    <xsl:with-param name="count" 
                         select="floor($count div 2)"/>
               </xsl:call-template>     
          </xsl:otherwise>
     </xsl:choose>
</xsl:template>

XSLT 2.0

In 2.0, we can duplicate quite easily with a for expression. We overload dup to replicate the behavior of the defaulted argument in the XSLT 1.0 implementation:

<xsl:function name="ckbk:dup">
    <xsl:param name="input" as="xs:string"/>
    <xsl:sequence select="ckbk:dup($input,2)"/>
  </xsl:function>

  <xsl:function name="ckbk:dup">
    <xsl:param name="input" as="xs:string"/>
    <xsl:param name="count" as="xs:integer"/>
    <xsl:sequence select="string-join(for $i in 1 to $count return $input,'')"/>
  </xsl:function>

Discussion

XSLT 1.0

The most obvious way to duplicate a string $count times is to figure out a way to concatenate the string to itself $count-1 times. This can be done recursively by the following code, but this code will be expensive unless $count is small, so it is not recommended:

<xsl:template name="slow-dup">
     <xsl:param name="input"/>
     <xsl:param name="count" select="1"/>
     <xsl:param name="work" select="$input"/>
     <xsl:choose>
          <xsl:when test="not($count) or not($input)"/>
          <xsl:when test="$count=1">
               <xsl:value-of select="$work"/>
          </xsl:when>
          <xsl:otherwise>
               <xsl:call-template name="slow-dup">
                    <xsl:with-param name="input" select="$input"/>
                    <xsl:with-param name="count" select="$count - 1"/>
                    <xsl:with-param name="work"
                         select="concat($work,$input)"/>
               </xsl:call-template>               
          </xsl:otherwise>
     </xsl:choose>
</xsl:template>

A better approach is shown in the “Solution” section. The solution limits the number of recursive calls and concatenation to the order of log2($count) by repeatedly doubling the input and halving the count as long as count is greater than 1. The slow-dup implementation is awkward since it requires an artificial work parameter to keep track of the original input. It may also result in stack growth due to recursion of $count-1 and requires $count-1 calls to concat(). Contrast this to dup that limits stack growth to floor(log2($count)) and requires only ceiling(log2($count)) calls to concat().

Tip

The slow-dup technique has the redeeming quality of also being used to duplicate structure in addition to strings if we replace xsl:value-of with xsl:copy-of. The faster dup has no advantage in this case because the copies are passed around as parameters, which is expensive.

Another solution based on, but not identical to, code from EXSLT str:padding is the following:

<xsl:template name="dup">
  <xsl:param name="input"/>
  <xsl:param name="count" select="1"/>
  <xsl:choose>
    <xsl:when test="not($count) or not($input)" />
    <xsl:otherwise>
      <xsl:variable name="string" 
                      select="concat($input, $input, $input, $input, 
                                     $input, $input, $input, $input,
                                     $input, $input)"/>
      <xsl:choose>
        <xsl:when test="string-length($string) >= 
                         $count * string-length($input)">
          <xsl:value-of select="substring($string, 1, 
                              $count * string-length($input))" />
        </xsl:when>
        <xsl:otherwise>
          <xsl:call-template name="dup">
            <xsl:with-param name="input" select="$string" />
            <xsl:with-param name="count" select="$count div 10" />
          </xsl:call-template>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

This implementation makes ten copies of the input. If this approach accomplishes more than is required, it trims the result to the required size. Otherwise, it applies the template recursively. This solution is slower because it will often do more concatenations than necessary and it uses substring(), which may be slow on some XSLT implementations. See Recipe 2.7 for an explanation. It does have an advantage for processors that do not optimize tail recursion since it reduces the number of recursive calls significantly.

2.6. Reversing a String

Problem

You need to reverse the characters of a string.

Solution

XSLT 1.0

This template reverses $input in a subtle yet effective way:

<xsl:template name="reverse">
     <xsl:param name="input"/>
     <xsl:variable name="len" select="string-length($input)"/>
     <xsl:choose>
          <!-- Strings of length less than 2 are trivial to reverse -->
          <xsl:when test="$len &lt; 2">
               <xsl:value-of select="$input"/>
          </xsl:when>
          <!-- Strings of length 2 are also trivial to reverse -->
          <xsl:when test="$len = 2">
               <xsl:value-of select="substring($input,2,1)"/>
               <xsl:value-of select="substring($input,1,1)"/>
          </xsl:when>
          <xsl:otherwise>
               <!-- Swap the recursive application of this template to 
               the first half and second half of input -->
               <xsl:variable name="mid" select="floor($len div 2)"/>
               <xsl:call-template name="reverse">
                    <xsl:with-param name="input"
                         select="substring($input,$mid+1,$mid+1)"/>
               </xsl:call-template>
               <xsl:call-template name="reverse">
                    <xsl:with-param name="input"
                         select="substring($input,1,$mid)"/>
               </xsl:call-template>
          </xsl:otherwise>
     </xsl:choose>
</xsl:template>

XSLT 2.0

Reversing is trivial in 2.0.

<xsl:function name="ckbk:reverse">
    <xsl:param name="input" as="xs:string"/>
    <xsl:sequence select="codepoints-to-string(
                           reverse(string-to-codepoints($input)))"/>
  </xsl:function>

Discussion

XSLT 1.0

The algorithm shown in the solution is not the most obvious, but it is efficient. In fact, this algorithm successfully reverses even very large strings, whereas other more obvious algorithms either take too long or fail with a stack overflow. The basic idea behind this algorithm is to swap the first half of the string with the second half and to keep applying the algorithm to these halves recursively until you are left with strings of length two or less, at which point the reverse operation is trivial. The following example illustrates how this algorithm works. At each step, I placed a + where the string was split and concatenated.

reverse(“abcdef”) (input)
reverse(def)+reverse(“abc”)
reverse(“ef”) + “d” + reverse(“bc”) + “a”
“f” + “e” + “d” + “c” + “b” + “a”
fedcba (result)

Considering more obvious XSLT implementations of reverse is instructive because they provide lessons in how and how not to implement recursive solutions in other contexts.

One of the worst algorithms is probably the one that many would think of on their first try. The idea is to swap the first and last character of the string, continue to the second and next to last, and so on until you reach the middle, at which point you are done. A C programmer might come up with this solution, since it is a perfectly efficient iterative solution in a language like C in which you can read and write individual characters of the string randomly and iteration rather than recursion is the norm. However, in XSLT you must implement this algorithm, shown in Example 2-1, in a recursive fashion, and you do not have the luxury of manipulating variables in place.

Example 2-1. A very poor implementation of reverse

<xsl:template name="reverse">   
     <xsl:param name="input"/>
     <xsl:variable name="len" select="string-length($input)"/>
     <xsl:choose>
          <!-- Strings of length less than 2 are trivial to reverse -->
          <xsl:when test="$len &lt; 2">
               <xsl:value-of select="$input"/>
          </xsl:when>
          <!-- Strings of length 2 are also trivial to reverse -->
          <xsl:when test="$len = 2">
               <xsl:value-of select="substring($input,2,1)"/>
               <xsl:value-of select="substring($input,1,1)"/>
          </xsl:when>
          <xsl:otherwise>
               <!-- Concatenate the last + reverse(middle) + first -->
               <xsl:value-of select="substring($input,$len,1)"/>
               <xsl:call-template name="reverse">
                    <xsl:with-param name="input"
                         select="substring($input,2,$len - 2)"/> 
               </xsl:call-template>
               <xsl:value-of select="substring($input,1,1)"/>
          </xsl:otherwise>
     </xsl:choose>
</xsl:template>

A major problem with this solution makes it useless for all but very short strings. The problem is that the solution is not tail recursive (see the Tail Recursion sidebar for an explanation of tail recursion). Many XSLT processors (such as Saxon) optimize for tail recursion, so you are advised to structure your code to benefit from this significant optimization. Example 2-2 makes this version of reverse tail recursive by moving only the last character in the string to the front on each recursive call. This puts the recursive call at the end and thus subject to the optimization.

Example 2-2. An inefficient tail recursive implementation

<xsl:template name="reverse">   
     <xsl:param name="input"/>
     <xsl:variable name="len" select="string-length($input)"/>
     <xsl:choose>
          <!-- Strings of length less than 2 are trivial to reverse -->
          <xsl:when test="$len &lt; 2">
               <xsl:value-of select="$input"/>
          </xsl:when>
          <!-- Strings of length 2 are also trivial to reverse -->
          <xsl:when test="$len = 2">
               <xsl:value-of select="substring($input,2,1)"/>
               <xsl:value-of select="substring($input,1,1)"/>
          </xsl:when>
          <!-- Concatenate the last + reverse(rest) -->
          <xsl:otherwise>
            <xsl:value-of select="substring($input,$len,1)"/>
              <xsl:call-template name="reverse">
               <xsl:with-param name="input" select="substring($input,1,$len - 1)"/> 
              </xsl:call-template>
          </xsl:otherwise>
     </xsl:choose>
</xsl:template>

This change prevents reverse from overflowing the stack, but it is still inefficient for large strings. First, notice that each step results in the movement of only a single character. Second, each recursive call must process a string that is just one character shorter than the current string. For very large strings, this call will potentially overstress the memory management subsystem of the XSLT implementation. In editing this recipe, Jeni Tennison pointed out that another method of making the version tail recursive would pass the remaining (reverse) string and $len as a parameter to the template. This, in general, is a good strategy for achieving tail recursion. In this particular case, it improved matters but did not do as well as the solution.

An important goal in all recursive implementations is to try to structure the algorithm so that each recursive call sets up a subproblem that is at least half as large as the current problem. This setup causes the recursion to “bottom out” more quickly. Following this advice results in the solution to reverse, shown in Example 2-3.

Example 2-3. An efficient (but not ideal) implementation

<xsl:template name="reverse">
     <xsl:param name="input"/>
   
     <xsl:variable name="len" select="string-length($input)"/>
     <xsl:choose>
          <xsl:when test="$len &lt; 2">
               <xsl:value-of select="$input"/>
          </xsl:when>
          <xsl:otherwise>
               <xsl:variable name="mid" select="floor($len div 2)"/>
               <xsl:call-template name="reverse">
                    <xsl:with-param name="input"
                         select="substring($input,$mid+1,$mid+1)"/>
               </xsl:call-template>
               <xsl:call-template name="reverse">
                    <xsl:with-param name="input"
                         select="substring($input,1,$mid)"/>
               </xsl:call-template>
          </xsl:otherwise>
     </xsl:choose>
</xsl:template>

This solution is the first one I came up with, and it works well even on large strings (1,000 characters or more). It has the added benefit of being shorter than the implementation shown in the “Solution” section. The only difference is that this implementation considers only strings of length zero or one as trivial. The slightly faster implementation cuts the number of recursive calls in half by also trivially dealing with strings of length two.

All the implementations shown here actually perform the same number of concatenations, and I do not believe there is any way around this without leaving the confines of XSLT. However, my testing shows that on a string of length 1,000, the best solution is approximately 5 times faster than the worst. The best and second-best solutions differ by only a factor of 1.3.

A recursive call is tail recursive if, when the call returns, the returned value is immediately returned from the function. The term “tail” is attributed to the recursive call, which comes at the end. Tail recursion is important because it can be implemented more efficiently than general recursion. A general recursive call must establish a new stack frame to store local variables and other bookkeeping items. Thus, a general recursive implementation can quickly exhaust the stack space on large inputs. However, tail-recursive implementations can be transformed internally into iterative solutions by an XSLT processor capable of recognizing tail recursion.

XSLT 2.0

The XSLT 1.0 solution manipulates the string as substrings because there is no way to get to the Unicode character level. The 2.0 solution uses the functions string-to-codepoints and codepoints-to-string, which is probably faster in most 2.0 implementations because internally strings are just arrays of Unicode integer values.

2.7. Replacing Text

Problem

You want to replace all occurrences of a substring within a target string with another string.

Solution

XSLT 1.0

The following recursive template replaces all occurrences of a search string with a replacement string:

<xsl:template name="search-and-replace">
     <xsl:param name="input"/>
     <xsl:param name="search-string"/>
     <xsl:param name="replace-string"/>
     <xsl:choose>
          <!-- See if the input contains the search string -->
          <xsl:when test="$search-string and 
                           contains($input,$search-string)">
          <!-- If so, then concatenate the substring before the search
          string to the replacement string and to the result of
          recursively applying this template to the remaining substring.
          -->
               <xsl:value-of 
                    select="substring-before($input,$search-string)"/>
               <xsl:value-of select="$replace-string"/>
               <xsl:call-template name="search-and-replace">
                    <xsl:with-param name="input"
                    select="substring-after($input,$search-string)"/>
                    <xsl:with-param name="search-string" 
                    select="$search-string"/>
                    <xsl:with-param name="replace-string" 
                        select="$replace-string"/>
               </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
               <!-- There are no more occurrences of the search string so 
               just return the current input string -->
               <xsl:value-of select="$input"/>
          </xsl:otherwise>
     </xsl:choose>
</xsl:template>

If you want to replace only whole words, then you must ensure that the characters immediately before and after the search string are in the class of characters considered word delimiters. We chose the characters in the variable $punc plus whitespace to be word delimiters:

<xsl:template name="search-and-replace-whole-words-only">
  <xsl:param name="input"/>
  <xsl:param name="search-string"/>
  <xsl:param name="replace-string"/>
  <xsl:variable name="punc" 
    select="concat('.,;:()[  ]!?$@&amp;&quot;',&quot;&apos;&quot;)"/>
     <xsl:choose>
       <!-- See if the input contains the search string -->
       <xsl:when test="contains($input,$search-string)">
       <!-- If so, then test that the before and after characters are word 
       delimiters. -->
         <xsl:variable name="before" 
          select="substring-before($input,$search-string)"/>
         <xsl:variable name="before-char" 
          select="substring(concat(' ',$before),string-length($before) +1, 1)"/>
         <xsl:variable name="after" 
          select="substring-after($input,$search-string)"/>
         <xsl:variable name="after-char" 
          select="substring($after,1,1)"/>
         <xsl:value-of select="$before"/>
         <xsl:choose>
          <xsl:when test="(not(normalize-space($before-char)) or 
                    contains($punc,$before-char)) and 
               (not(normalize-space($after-char)) or 
                    contains($punc,$after-char))"> 
            <xsl:value-of select="$replace-string"/>
          </xsl:when>
          <xsl:otherwise>
            <xsl:value-of select="$search-string"/>
          </xsl:otherwise>
         </xsl:choose>
         <xsl:call-template name="search-and-replace-whole-words-only">
          <xsl:with-param name="input" select="$after"/>
          <xsl:with-param name="search-string" select="$search-string"/>
          <xsl:with-param name="replace-string" select="$replace-string"/>
         </xsl:call-template>
       </xsl:when>
    <xsl:otherwise>
       <!-- There are no more occurrences of the search string so 
          just return the current input string -->
       <xsl:value-of select="$input"/>
     </xsl:otherwise>
  </xsl:choose>
</xsl:template>

Tip

Notice how we construct $punc using concat() so it contains both single and double quotes. It would be impossible to do this in any other way because XPath and XSLT, unlike C, do not allow special characters to be escaped with a backslash (). XPath 2.0 allows the quotes to be escaped by doubling them up.

XSLT 2.0

The functionality of search-and-replace is built-in to the 2.0 function replace(). The functionality of search-and-replace-whole-words-only can easily be emulated using a regex that matches words:

<xsl:function name="ckbk:search-and-replace-whole-words-only">
    <xsl:param name="input" as="xs:string"/>
    <xsl:param name="search-string" as="xs:string"/>
    <xsl:param name="replace-string" as="xs:string"/>
    <xsl:sequence select="replace($input, concat('(^|W)',$search-string,'(W|$)'), 
    concat('$1',$replace-string,'$2'))"/>
</xsl:function>

Warning

Many regex engines use to match word boundaries, but XPath 2.0 does not support this.

Here we build up a regex by surrounding $search-string with (^|W) and (W|$) where W means “not w" or “not a word character.” The ^ and $ handle the case when the word appears at the beginning or end of the string. We also need to put the matched W character back into the text using references to the captured groups $1 and $2.

The function replace() is more powerful than the preceding XSLT 1.0 solutions because it uses regular expressions and can remember parts of the match and use them in the replacement via the variables $1, $2, etc. We explore replace() further in Recipe 2.10.

Discussion

Searching and replacing is a common text-processing task. The solution shown here is the most straightforward implementation of search and replace written purely in terms of XSLT. When considering the performance of this solution, the reader might think it is inefficient. For each occurrence of the search string, the code will call contains(), substring-before() , and substring-after() . Presumably, each function will rescan the input string for the search string. It seems like this approach will perform two more searches than necessary. After some thought, you might come up with one of the following, seemingly more efficient, solutions shown in Example 2-4 and Example 2-5.

Example 2-4. Using a temp string in a failed attempt to improve search and replace

<xsl:template name="search-and-replace">
     <xsl:param name="input"/>
     <xsl:param name="search-string"/>
     <xsl:param name="replace-string"/>
     <!-- Find the substring before the search string and store it in a 
     variable -->
     <xsl:variable name="temp" 
          select="substring-before($input,$search-string)"/>
     <xsl:choose>
          <!-- If $temp is not empty or the input starts with the search 
          string then we know we have to do a replace. This eliminates the 
          need to use contains(). -->
          <xsl:when test="$temp or starts-with($input,$search-string)">
               <xsl:value-of select="concat($temp,$replace-string)"/>
               <xsl:call-template name="search-and-replace">
                    <!-- We eliminate the need to call substring-after
                    by using the length of temp and the search string 
                    to extract the remaining string in the recursive 
                    call. -->
                    <xsl:with-param name="input"
                    select="substring($input,string-length($temp)+
                         string-length($search-string)+1)"/>
                    <xsl:with-param name="search-string" 
                         select="$search-string"/>
                    <xsl:with-param name="replace-string" 
                         select="$replace-string"/>
               </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
               <xsl:value-of select="$input"/>
          </xsl:otherwise>
     </xsl:choose>
</xsl:template>

Example 2-5. Using a temp integer in a failed attempt to improve search and replace

 <xsl:template name="search-and-replace">
     <xsl:param name="input"/>
     <xsl:param name="search-string"/>
     <xsl:param name="replace-string"/>
     <!-- Find the length of the sub-string before the search string and 
     store it in a variable -->
     <xsl:variable name="temp" 
     select="string-length(substring-before($input,$search-string))"/>
     <xsl:choose>
     <!-- If $temp is not 0 or the input starts with the search 
     string then we know we have to do a replace. This eliminates the 
     need to use contains(). -->
          <xsl:when test="$temp or starts-with($input,$search-string)">
               <xsl:value-of select="substring($input,1,$temp)"/>
               <xsl:value-of select="$replace-string"/>
                    <!-- We eliminate the need to call substring-after
                    by using temp and the length of the search string 
                    to extract the remaining string in the recursive 
                    call. -->
               <xsl:call-template name="search-and-replace">
                    <xsl:with-param name="input"
                         select="substring($input,$temp + 
                              string-length($search-string)+1)"/>
                    <xsl:with-param name="search-string"
                         select="$search-string"/>
                    <xsl:with-param name="replace-string"
                         select="$replace-string"/>
               </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
               <xsl:value-of select="$input"/>
          </xsl:otherwise>
     </xsl:choose>
</xsl:template>

The idea behind both attempts is that if you remember the spot where substring-before() finds a match, then you can use this information to eliminate the need to call contains( ) and substring-after(). You are forced to introduce a call to starts-with() to disambiguate the case in which substring-before() returns the empty string; this can happen when the search string is absent or when the input string starts with the search string. However, starts-with() is presumably faster than contains() because it doesn’t need to scan past the length of the search string. The idea that distinguishes the second attempt from the first is the thought that storing an integer offset might be more efficient than storing the entire substring.

Alas, these supposed optimizations fail to produce any improvement when using the Xalan XSLT implementation and actually produce timing results that are an order of magnitude slower on some inputs when using either Saxon or XT! My first hypothesis regarding this unintuitive result was that the use of the variable $temp in the recursive call interfered with Saxon’s tail-recursion optimization (see Recipe 2.6). However, by experimenting with large inputs that have many matches, I failed to cause a stack overflow. My next suspicion was that for some reason, XSLT substring() is actually slower than the substring-before( ) and substring-after() calls. Michael Kay, the author of Saxon, indicated that Saxon’s implementation of substring() was slow due to the complicated rules that XSLT substring must implement, including floating-point rounding of arguments, handling special cases where the start or end point are outside the bounds of the string, and issues involving Unicode surrogate pairs. In contrast, substring-before() and substring-after() translate more directly into Java.

The real lesson here is that optimization is tricky business, especially in XSLT where there can be a wide disparity between implementations and where new versions continually apply new optimizations. Unless you are prepared to profile frequently, it is best to stick with simple solutions. An added advantage of obvious solutions is that they are likely to behave consistently across different XSLT implementations.

2.8. Converting Case

Problem

You want to convert an uppercase string to lowercase or vice versa.

Solution

XSLT 1.0

Use the XSLT translate() function. This code, for example, converts from upper- to lowercase:

translate($input,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')

This example converts from lower- to uppercase:

translate($input, 'abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLMNOPQRSTUVWXYZ')

XSLT 2.0

Use the XPath 2.0 functions upper-case() and lower-case() :

upper-case($input)
lower-case($input)

Discussion

This recipe is, of course, trivial. However, I include it as an opportunity to discuss the XSLT 1.0 solution’s shortcomings. Case conversion is trivial as long as your text is restricted to a single locale. In English, you rarely, if ever, need to deal with special characters containing accents or other complicated case conversions in which a single character must convert to two characters. The most common example is German, in which the lowercase ß (eszett) is converted to an uppercase SS. Many modern programming languages provide case-conversion functions that are sensitive to locale, but XSLT does not support this concept directly. This is unfortunate, considering that XSLT has other features supporting internationalization.

A slight improvement can be made by defining general XML entities for each type conversion, as shown in the following example:

<?xml version="1.0" encoding="UTF-8"?>   
<!DOCTYPE stylesheet [
     <!ENTITY UPPERCASE "ABCDEFGHIJKLMNOPQRSTUVWXYZ">
     <!ENTITY LOWERCASE "abcdefghijklmnopqrstuvwxyz">
     <!ENTITY UPPER_TO_LOWER " '&UPPERCASE;' , '&LOWERCASE;' ">
     <!ENTITY LOWER_TO_UPPER " '&LOWERCASE;' , '&UPPERCASE;' ">
]>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
   
     <xsl:template match="/">
     <xsl:variable name="test"
          select=" 'The rain in Spain falls mainly on the plain' "/>
     <output>
          <lowercase>
               <xsl:value-of
                    select="translate($test,&UPPER_TO_LOWER;)"/>
          </lowercase>
          <uppercase>
               <xsl:value-of
                    select="translate($test,&LOWER_TO_UPPER;)"/>
          </uppercase>
     </output>
     </xsl:template>
   
</xsl:stylesheet>

These entity definitions accomplish three things. First, they make it easier to port the stylesheet to another locale because only the definition of the entities UPPERCASE and LOWERCASE need be changed. Second, they compact the code by eliminating the need to list all letters of the alphabet twice. Third, they make the intent of the translate call obvious to someone inspecting the code. Some purists might object to the macro-izing away of translate()’s third parameter, but I like the way it makes the code read. If you prefer to err on the pure side, then use translate($test, &UPPERCASE;, &LOWERCASE;).

I have not seen entities used very often in other XSLT books; however, I believe the technique has merit. In fact, one benefit of XSLT being written in XML syntax is that you can exploit all features of XML, and entity definition is certainly a useful one. If you intend to use this technique and plan to write more than a few stylesheets, then consider placing common entity definitions in an external file and include them as shown in Example 2-6. You can also store these values in global variables in an external stylesheet and import them as needed. This alternative is preferred by many XSLT veterans.

Example 2-6. Standard.ent

<!ENTITY UPPERCASE "ABCDEFGHIJKLMNOPQRSTUVWXYZ">   
<!ENTITY LOWERCASE "abcdefghijklmnopqrstuvwxyz">
<!ENTITY UPPER_TO_LOWER " '&UPPERCASE;' , '&LOWERCASE;' ">
<!ENTITY LOWER_TO_UPPER " '&LOWERCASE;' , '&UPPERCASE;' ">
<!-- others... -->

Then use a parameter entity defined in terms of the external standard.ent file, as shown in Example 2-7.

Example 2-7. A stylesheet using standard.ent

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE stylesheet [
     <!ENTITY % standard SYSTEM "standard.ent">
     %standard;
]>
<xsl:stylesheet version="1.0" 
<!-- ... -->
</xsl:stylesheet>

Steve Ball’s implementation of case conversion works in virtually all cases by including all the most common Unicode characters in the upper- and lowercase strings and taking special care to handle the German ß correctly.

XSLT 2.0

The new XPath 2.0 functions upper-case() and lower-case() resolve most of the issue with case conversion that can occur in non-English alphabets. The one exception is Unicode locale-sensitive conversions. It is best not to use these functions for the purpose of doing case insensitive comparison. Rather, use compare() with a collation that ignores case. Saxon 8.x user can find information about collations at http://www.saxonica.com/documentation/conformance/collation-uri.html and http://www.saxonica.com/documentation/extensions/instructions/collation.html.

2.9. Tokenizing a String

Problem

You want to break a string into a list of tokens based on the occurrence of one or more delimiter characters.

Solution

XSLT 1.0

Jeni Tennison implemented this solution (but the comments are my doing). The tokenizer returns each token as a node consisting of a token element text. It also defaults to character-level tokenization if the delimiter string is empty:

<xsl:template name="tokenize">
  <xsl:param name="string" select="''" />
  <xsl:param name="delimiters" select="' &#x9;&#xA;'" />
  <xsl:choose>
     <!-- Nothing to do if empty string -->
    <xsl:when test="not($string)" />
   
     <!-- No delimiters signals character level tokenization. -->
    <xsl:when test="not($delimiters)">
      <xsl:call-template name="_tokenize-characters">
        <xsl:with-param name="string" select="$string" />
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:call-template name="_tokenize-delimiters">
        <xsl:with-param name="string" select="$string" />
        <xsl:with-param name="delimiters" select="$delimiters" />
      </xsl:call-template>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>
   
<xsl:template name="_tokenize-characters">
  <xsl:param name="string" />
  <xsl:if test="$string">
    <token><xsl:value-of select="substring($string, 1, 1)" /></token>
    <xsl:call-template name="_tokenize-characters">
      <xsl:with-param name="string" select="substring($string, 2)" />
    </xsl:call-template>
  </xsl:if>
</xsl:template>
   
<xsl:template name="_tokenize-delimiters">
  <xsl:param name="string" />
  <xsl:param name="delimiters" />
  <xsl:param name="last-delimit"/> 
  <!-- Extract a delimiter -->
  <xsl:variable name="delimiter" select="substring($delimiters, 1, 1)" />
  <xsl:choose>
     <!-- If the delimiter is empty we have a token -->
    <xsl:when test="not($delimiter)">
      <token><xsl:value-of select="$string"/></token>
    </xsl:when>
     <!-- If the string contains at least one delimiter we must split it -->
    <xsl:when test="contains($string, $delimiter)">
      <!-- If it starts with the delimiter we don't need to handle the -->
       <!-- before part -->
      <xsl:if test="not(starts-with($string, $delimiter))">
         <!-- Handle the part that comes before the current delimiter -->
         <!-- with the next delimiter. If there is no next the first test -->
         <!-- in this template will detect the token -->
        <xsl:call-template name="_tokenize-delimiters">
          <xsl:with-param name="string" 
                          select="substring-before($string, $delimiter)" />
          <xsl:with-param name="delimiters" 
                          select="substring($delimiters, 2)" />
        </xsl:call-template>
      </xsl:if>
       <!-- Handle the part that comes after the delimiter using the -->
       <!-- current delimiter -->
      <xsl:call-template name="_tokenize-delimiters">
        <xsl:with-param name="string" 
                        select="substring-after($string, $delimiter)" />
        <xsl:with-param name="delimiters" select="$delimiters" />
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
       <!-- No occurrences of current delimiter so move on to next -->
      <xsl:call-template name="_tokenize-delimiters">
        <xsl:with-param name="string" 
                        select="$string" />
        <xsl:with-param name="delimiters" 
                        select="substring($delimiters, 2)" />
      </xsl:call-template>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>
   
</xsl:stylesheet>

XSLT 2.0

Use the XPath 2.0 tokenize() function covered in Recipe 2.11.

Discussion

Tokenization is a common string-processing task. In languages with powerful regular-expression engines, tokenization is trivial. In this area, languages such as Perl, Python, JavaScript, and Tcl currently outshine XSLT. However, this recipe shows that XSLT can deal with tokenization if you must stay within the bounds of pure XSLT. If you are willing to use extensions, then you can defer to another language for low-level string manipulations such as tokenization.

If you use the XSLT approach and your processor does not optimize for tail-recursion, then you may want to use a divide-and-conquer algorithm for character tokenization:

<xsl:template name="_tokenize-characters">
  <xsl:param name="string" />
  <xsl:param name="len" select="string-length($string)"/>
  <xsl:choose>
       <xsl:when test="$len = 1">
       <token><xsl:value-of select="$string"/></token>
       </xsl:when>
       <xsl:otherwise>
      <xsl:call-template name="_tokenize-characters">
        <xsl:with-param name="string" 
                       select="substring($string, 1, floor($len div 2))" />
        <xsl:with-param name="len" select="floor($len div 2)"/>
      </xsl:call-template>
      <xsl:call-template name="_tokenize-characters">
        <xsl:with-param name="string" 
                      select="substring($string, floor($len div 2) + 1)" />
        <xsl:with-param name="len" select="ceiling($len div 2)"/>
      </xsl:call-template>
       </xsl:otherwise>
     </xsl:choose>
</xsl:template>

2.10. Making Do Without Regular Expressions

Problem

You would like to perform regular-expression-like operations in XSLT 1.0, but you don’t want to resort to nonstandard extensions.

Solution

Several common regular-expression-like matches can be emulated in native XPath 1.0. Table 2-1 lists the regular-expression matches by using Perl syntax along with their XSLT/XPath equivalent. The single character “C” is a proxy for any user-specified single character, and the string “abc” is a proxy for any user supplied-string of nonzero length.

Table 2-1. Regular-expression matches

$string =~ /^C*$/	translate($string,'C','') = ''
$string =~ /^C+$/	$string and translate($string,'C', '') = ''
$string =~ /C+/	contains($string,'C')
$string =~ /C{2,4}/	contains($string,'CC') and not(contains($string,'CCCCC'))
$string =~ /^abc/	starts-with($string,'abc')
$string =~ /abc$/	substring($string, string-length($string) - string-length('abc') + 1) = 'abc'
$string =~ /abc/	contains($string,'abc')
$string =~ /^[^C]*$/	translate($string,'C','') = $string
$string =~ /^s+$/	not(normalize-space($string))
$string =~ /s/	translate(normalize-space($string),' ','') != $string
$string =~ /^S+$/	translate(normalize-space($string),' ','') = $string

Discussion

When it comes to brevity and power, nothing beats a good regular-expression engine. However, many simple matching operations can be emulated by more cumbersome yet effective XPath expressions. Many of these matches are facilitated by translate(), which removes extraneous characters so the match can be implemented as an equality test. Another useful application of translate is its ability to count the number of occurrences of a specific character or set of characters. For example, the following code counts the number of numeric characters in a string:

string-length(translate($string, 
          translate($string,'0123456789',''),''))

If it is unclear what this code does, refer to Recipe 2.3. Alternatively, you can write:

string-length($string) - 
string-length(translate($string,'0123456789',''))

This code trades a translate() call for an additional string-length() and a subtraction. It might be slightly faster.

An important way in which these XPath expressions differ from their Perl counterparts is that in Perl, special variables are set as a side effect of matching. These variables allow powerful string-processing techniques that are way beyond the scope of XSLT. If anyone attempted to mate Perl and XSLT into a hybrid language, I would want to be one of the first alpha users!

The good news is that XPath 2.0 now supports regular expressions. I cover this welcome addition in Recipe 2.11, next.

2.11. Exploiting Regular Expressions

Problem

You heard regular expressions (regex) are a powerful new tool in XSLT 2.0, but you are unsure how to harness this power.

Solution

Matching text patterns

The most basic application of regex is matching text patterns. You can use matches() in a template pattern to extend XSLT’s matching capabilities into the text of a node:

<!-- -->

<!-- A date in the form May 3, 1964 -->  
<xsl:template match="birthday[matches(.,'^[A-Z][a-z]+s[0-9]+,s[0-9]+$')]">
   <!-- ... -->
</xsl:template>

<!-- A date in the form 1964-05-03 -->
<xsl:template match="birthday[matches(.,'^[0-9]+-[0-9]+-[0-9]+$')]">
   <!-- ... -->
</xsl:template>
 
<!-- A date in the form 3 May 1964 -->
<xsl:template match="birthday[matches(.,'^[0-9]+s[A-Z][a-z]+s[0-9]+$')]">
   <!-- ... -->
</xsl:template>

Alternatively, you can use matches in an xsl:if or xsl:choose instruction:

<xsl:choose>
   <xsl:when test="matches($date,'^[A-Z][a-z]+s[0-9]+,s[0-9]+$')">
   </xsl:when>
   <xsl:when test="matches($date,'^[0-9]+-[0-9]+-[0-9]+$')">
   </xsl:when>
   <xsl:when test="matches($date,'^[0-9]+s[A-Z][a-z]+s[0-9]+$')">
   </xsl:when>
</xsl:choose>

Tokenizing stylized text

Often one uses regex to split a string into tokens:

(: Break an ISO date (YYYY-MM-DD) into a sequence consisting of year, month, day :)
tokenize($date, '-') 

(: Break an ISO dateTime (YYYY-MM-DDThh:mm:ss) into a sequence consisting of year, month, day, hour, 
min, sec :)
tokenize($date, '-|T|:') 

(: Break a sentence into words :)
tokenize($text, 'W+')

Replacing and augmenting text

There are two ways to use the XPath replace() function.

The first is simply to replace patterns in a string with other text. Sometimes you will replace the pattern with the empty string (`') because you want to strip the text that matches the pattern:

(: Replace the day of the month in an ISO date with 01 :)
replace($date,'dd$','01')

(: Strip away all but the year in an ISO date :)
replace($date,'-dd-dd$','')

The second way you use replace is to insert text into the string where a pattern matches while leaving the matched part intact. It may seem counterintuitive that you can use a function called replace to perform an insertion; however, this is exactly the effect you can achieve by using back reference variables.

(: Insert a space after punctuation characters that are not followed by a space :)
replace($text, '([,;:])S', '$1 ')

Parsing text to convert to XML

More powerful than either tokenize() or replace() is the new XSLT 2.0 xsl:analyze-string instruction. This function allows one to go beyond textual substitution and build up XML content from text. See Chapter 6 for recipes using xsl:analyze-string.

Discussion

Regular expressions (or simply regex) are such a rich and powerful tool for text processing that one could write a whole book dedicated to them. In fact, someone did. Jeffery E. F. Friedl’s book Mastering Regular Expressions (O’Reilly) is a classic on the topic, and I highly recommended it.

Regular expressions derive their power from pattern matching. Interestingly, pattern matching is also at the heart of XSLT’s power. Where XSLT is ideally suited to matching patterns in the structure of an XML document, regular expressions are optimized for matching patterns in ad hoc text. However, the pattern language of regular expressions is more intricate than the XPath expressions used in XSLT. This is unavoidable simply because ad hoc text lacks the uniform tree structure of XML.

The keys to mastering regular expressions are practice and judicious borrowing from example expressions designed by others. Beside Friedl’s book, one can find sample regex patterns in many of the books on Perl and online at RegExLib.com (http://regexlib.com/).

2.12. Using the EXSLT String Extensions

Problem

You have good reason to use extension functions for string processing, but you are concerned about portability.

Solution

You may find that your XSLT processor already implements string functions defined by the EXSLT community (http://www.exslt.org/). At the time of publication, these functions are:

node-set str:tokenize(string input, string delimiters?)

The str:tokenize function splits up a string and returns a node set of token elements, each containing one token from the string.

The first argument is the string to be tokenized. The second argument is a string consisting of a number of characters. Each character in this string is taken as a delimiting character. The string given by the first argument is split at any occurrence of any character.

If the second argument is omitted, the default is the string 	
  (i.e., whitespace characters).

If the second argument is an empty string, the function returns a set of token elements, each of which holds a single character.

node-set str:replace(string, object search, object replace)

The str:replace function replaces any occurrences of search strings within a string with replacement nodes to create a node set.

The first argument gives the string within which strings are to be replaced.

The second argument is an object that specifies a search string list. If the second argument is a node set, then the search string list shows the result of converting each node in the node set to a string with the string() function, listed in document order. If the second argument is not a node set, then the second argument is converted to a string with the string() function, and the search string list consists of this string only.

The third argument is an object that specifies a replacement node list. If the third argument is a node set, then the replacement node list consists of the nodes in the node set in document order. If the third argument is not a node set, then the replacement node list consists of a single text node whose string value is the same as the result of converting the third argument to a string with the string() function.

string str:padding(number, string?)

The str:padding function creates a padding string of a certain length.

The first argument gives the length of the padding string to be created.

The second argument gives a string necessary to create the padding. This string is repeated as many times as is necessary to create a string of the length specified by the first argument; if the string is more than a character long, it may have to be truncated to produce the required length. If no second argument is specified, it defaults to a space (” “). If the second argument is an empty string, str:padding returns an empty string.

string str:align(string, string, string?)

The str:align function aligns a string within another string.

The first argument gives the target string to be aligned. The second argument gives the padding string within which it will be aligned.

If the target string is shorter than the padding string, then a range of characters in the padding string are replaced with those in the target string. Which characters are replaced depends on the value of the third argument, which gives the type of alignment. It can be left, right, or center. If no third argument is given or if it is not one of these values, then it defaults to left alignment.

With left alignment, the range of characters replaced by the target string begins with the first character in the padding string. With right alignment, the range of characters replaced by the target string ends with the last character in the padding string. With center alignment, the range of characters replaced by the target string is in the middle of the padding string so that either the number of unreplaced characters on either side of the range is the same or there is one less on the left than on the right.

If the target string is longer than the padding string, then it is truncated to be the same length as the padding string and returned.

string str:encode-uri(string)

The str:encode-uri function returns an encoded URI. The str:encode-uri method does not encode the following characters: “:”, "/“, ";“, and "?“.

A URI-encoded string converts unsafe and reserved characters with "%“, immediately followed by two hexadecimal digits (0-9, A-F) giving the ISO Latin 1 code for that character.

string str:decode-uri(string)

The str:decode-uri function decodes a string that has been URI-encoded. See str:encode-uri for an explanation.

string str:concat(node-set)

The str:concat function takes a node set and returns the concatenation of the string values of the nodes in that set. If the node set is empty, it returns an empty string.

node-set str:split(string, string?)

The str:split function splits up a string and returns a node set of token elements, each containing one token from the string. The first argument is the string to be split. The second is a pattern string. The string given by the first argument is split at any occurrence of this pattern.

If the second argument is omitted, the default is the string   (i.e., a space).

If the second argument is an empty string, the function returns a set of token elements, each of which holds a single character.

Discussion

Using the EXSLT string functions does not guarantee portability, since currently no XSLT implementation supports them all. In fact, according to the EXSLT web site, some functions have no current implementation. The EXSLT team makes up for this by providing native XSLT implementations, JavaScript, and/or MSXML implementations whenever possible.

A good reason for using EXSLT is that the members of the EXSLT team are very active in the XSLT community and many implementations will probably support most of their extensions eventually. It is also possible that some of their work will be incorporated into a future standard XSLT release.

Table of Contents for 2. Strings

Create new playlist

Sign In

Sign Up

Chapter 2. Strings

Introduction

Tip

2.1. Testing If a String Ends with Another String

Problem

Solution

XSLT 1.0

XSLT 2.0

Discussion

Warning

2.2. Finding the Position of a Substring

Problem

Solution

XSLT 1.0

XSLT 2.0

Discussion

2.3. Removing Specific Characters from a String

Problem

Solution

XSLT 1.0

XSLT 2.0

Discussion

XSLT 2.0

2.4. Finding Substrings from the End of a String

Problem

Solution

XSLT 1.0

XSLT 2.0

Discussion

Warning

2.5. Duplicating a String N Times

Problem

Solution

XSLT 1.0

XSLT 2.0

Discussion

XSLT 1.0

Tip

See Also

2.6. Reversing a String

Problem

Solution

XSLT 1.0

XSLT 2.0

Discussion

XSLT 1.0

XSLT 2.0

2.7. Replacing Text

Problem

Solution

XSLT 1.0

Tip

XSLT 2.0

Warning

Discussion

2.8. Converting Case

Problem

Solution

XSLT 1.0

XSLT 2.0

Discussion

XSLT 2.0

See Also

2.9. Tokenizing a String

Problem

Solution

XSLT 1.0

XSLT 2.0

Discussion

See Also

2.10. Making Do Without Regular Expressions

Problem

Solution

Discussion

2.11. Exploiting Regular Expressions

Problem

Table of Contents for
2. Strings