Strings are probably the most used type of atomic values in queries. This chapter discusses constructing and comparing strings and provides an overview of the many built-in functions that manipulate strings. It also explains string- and text-related features such as whitespace handling and internationalization.
xs:string
Type The basic string type that is intended to represent generic character data is called, appropriately, xs:string
. The xs:string
type is not the default type for untyped values. If a value is selected from an input document with no schema, the value is given the type xs:untypedAtomic
, not xs:string
. However, it is easy enough to cast an untyped value to xs:string
. In fact, you can cast a value of any type to xs:string
and cast an xs:string
value to any type.
The xs:string
type is a primitive type from which a number of other types are derived. All the operations and functions that can be performed on xs:string
values can also be performed on values whose types are restrictions of xs:string
. This includes user-defined types that appear in a schema, as well as built-in derived types such as xs:token
, xs:language
, and xs:ID
. For a complete explanation of the built-in types, see Appendix B.
There are three common ways to construct strings: using string literals, the xs:string
constructor, and the string
function.
Strings can be included in queries as literals, using double or single quotes. For example, ($name = "Priscilla")
and string-length('query')
are valid expressions that contain string literals. If a literal value is enclosed in quotes, it is automatically assumed to be a string as opposed to a number.
Between quotes, you can escape the surrounding quote character by including it twice. For example, the literal expression "inner ""quotes""!"
evaluates to the string inner "quotes"!
. This is true for both single and double quotes.
In string literals, you can use single character references that use XML syntax. For example,  
can be used to include a space. You can also use the predefined entity references <
, >
, &
, "
, and '
. For example, you can specify the string literal "PB&J"
to represent the string PB&J
. In fact, ampersands must be escaped with &
in string literals.
xs:string
Constructor and the string
FunctionThere is a standard constructor for strings named xs:string
. The xs:string
constructor, like all constructors, accepts either a single atomic value or a single node. If it is an atomic value, it simply returns that value cast as an xs:string
.
Some types have special rules about how their values are formatted when they are cast to xs:string
. For example, integers have their leading zeros stripped, and xs:hexBinary
values have their letters converted to uppercase. In addition, when values of most non-string types are cast to xs:string
, their whitespace is collapsed. This means that consecutive whitespace characters are replaced by a single space, and leading and trailing whitespace is removed. The rules (if any) for each type are described in Appendix B.
If the xs:string
constructor is passed a node, it uses atomization to extract the typed value of the node, and then casts it to xs:string
. For an attribute, this is simply its value. For an element, it is the character data of the element itself and all its descendants, concatenated together in document order.
In addition, there is a built-in function named string
that has almost identical behavior. One difference is that if you use the string
function with no arguments, it will use the current context item.
Starting in version 3.1, there is an additional type of expression, called a string constructor, that allows you to create literal strings with intermingled expressions. This is especially useful for generating strings that are in the syntax of languages such as JSON, HTML or CSS that use curly brackets, angle brackets, quotation marks, or other strings that are delimiters in XQuery 3.1.
String constructors are delimited by ``[
and ]``
. Within these delimiters, expressions known as interpolations can appear, delimited by `{
and }`
. The rest of the characters are considered literal characters. For example:
let $prod1 := <product dept="WMN"> <number>557</number> <name language="en">Fleece Pullover</name> <colorChoices>navy black</colorChoices> </product> return ``[Name: `{$prod1/name}`, Number: `{$prod1/number}`]``
returns the string:
Name: Fleece Pullover, Number: 557
String constructors can be used to generate strings in HTML syntax. For example:
let $prod1 := <product dept="WMN"> <number>557</number> <name language="en">Fleece Pullover</name> <colorChoices>navy black</colorChoices> </product> return ``[<h1>`{$prod1/name}`</h1> <p>Number: `{$prod1/number}`</p> <h2>Colors</h2> `{for $color in $prod1/colorChoices/tokenize(.) return ``[<li>`{$color}`</li> ]`` }` ]``
returns the following as a string:
<h1>Fleece Pullover</h1> <p>Number: 557</p> <h2>Colors</h2> <li>navy</li> <li>black</li>
In the above example, the string constructor that outputs the color is nested within an interpolation inside another string constructor. Such nesting makes string constructors a powerful tool for templating the syntax of other languages, especially non-XML languages.
Several functions, summarized in Table 18-1, are available for comparing and matching strings.
Function name | Description |
---|---|
compare | Compares two strings, optionally based on a collation, returning -1, 0, or 1 |
codepoint-equal | Compares two strings based on codepoints, returning a Boolean value |
starts-with | Determines whether a string starts with another string |
ends-with | Determines whether a string ends with another string |
contains | Determines whether a string contains another string |
contains-token | Determines whether a string contains another string surrounded by whitespace |
matches | Determines whether a string matches a regular expression |
Strings can be compared using the comparison operators: =
, !=
, >
, <
, >=
, and <=
. For example, "abc" < "def"
evaluates to true
.
The comparison operators use the default collation, as described in “Collations”. You can also use the compare
function, which fulfills the same role as the comparison operators but allows you to explicitly specify a collation. The compare
function accepts two string arguments and returns one of the values -1
, 0
, or 1
, depending on which argument is greater.
Four functions test whether a string contains the characters of another string. They are the contains
, contains-token
, starts-with
, and ends-with
functions. Each of them returns a Boolean value and takes two strings as arguments: the first is the containing string being tested, and the second is the contained string. (The contains-token
function will also accept a sequence of multiple strings as its first argument.) Table 18-2 shows some examples of these functions.
Example | Return value |
---|---|
contains("query", "ery") | true |
contains("query", "x") | false |
contains-token("xml query", "query") | true |
contains-token( ("xml", "query"), "query") | true |
starts-with("query", "que") | true |
starts-with("query", "u") | false |
ends-with("query", "y") | true |
ends-with("query ", "y") | false |
The matches
function determines whether a string matches a pattern. It accepts two string arguments: the string being tested and the pattern itself. The pattern is a regular expression, whose syntax is covered in Chapter 19. There is also an optional third argument, which can be used to set additional options in the interpretation of the regular expression, such as multi-line processing and case sensitivity. These options are described in detail in “Using Flags”. Table 18-3 shows examples of the matches
function.
Example | Return value |
---|---|
matches("query", "q") | true |
matches("query", "qu") | true |
matches("query", "xyz") | false |
matches("query", "q.*") | true |
matches("query", "[a-z]{5}") | true |
Three functions are available to return part of a string. The substring
function returns a substring based on a starting position (starting at 1, not 0) and optionally a length. For example:
substring("query", 2, 3)
returns the string uer
. If no length is specified, the function returns the rest of the string. For example:
substring("query", 2)
returns uery
.
The substring-before
function returns all the characters of a string that occur before the first occurrence of another specified string. The substring-after
function returns all the characters of a string that occur after the first occurrence of another specified string. Table 18-4 shows examples of the substring functions.
Example | Return value |
---|---|
substring("query", 2, 3) | uer |
substring("query", 2) | uery |
substring-before("query", "er") | qu |
substring-before("queryquery", "er") | qu |
substring-after("query", "er") | y |
substring-after("queryquery", "er") | yquery |
The length of a string can be determined using the string-length
function. It accepts a single string and returns its length as an integer. Whitespace is significant, so leading and trailing whitespace characters are counted. Table 18-5 shows some examples.
Example | Return value |
---|---|
string-length("query") | 5 |
string-length(" query ") | 7 |
string-length(normalize-space(" query ")) | 5 |
string-length("") | 0 |
string-length(" ") | 1 |
Six functions, summarized in Table 18-6, concatenate and split apart strings.
Name | Description |
---|---|
concat | Concatenates two or more strings |
string-join | Concatenates a sequence of strings, optionally using a separator |
tokenize | Breaks a single string into a sequence of strings, using a specified separator |
analyze-string | Splits a string based on parts that match and don’t match a pattern |
codepoints-to-string | Converts a sequence of Unicode codepoint values to a string |
string-to-codepoints | Converts a string to a sequence of Unicode codepoint values |
Strings can be concatenated together using one of two functions: concat
or string-join
. The concat
function accepts individual string arguments and concatenates them together. This function is unique in that it accepts a variable number of arguments. For example:
concat("a", "b", "c")
returns the string abc
. The string-join
function, on the other hand, accepts a sequence of strings. For example:
string-join( ("a", "b", "c"))
also returns the string abc
. In addition, string-join
allows a separator to be passed as the second argument. For example:
string-join( ("a", "b", "c"), "/")
returns the string a/b/c
.
Starting in version 3.0, there is also a string concatenation operator, the double vertical bar (||
). This has the same effect as the concat
function but is slightly more convenient syntactically. For example:
"a" || "b" || "c"
returns the string abc
. As with the concat
function, the operands can be single nodes or atomic values of any type. They are atomized (if necessary) and cast to xs:string
before concatenation. A single operand of this operator cannot, however, be a sequence of multiple values. For that, the string-join
function is still the best option.
Strings can be split apart, or tokenized, using the tokenize
function. This function breaks a string into a sequence of strings, using a regular expression to designate the separator character(s). For example:
tokenize("a/b/c", "/")
returns a sequence of three strings: a
, b
, and c
. Regular expressions such as s
, which represents a whitespace character (space, line feed, carriage return, or tab), and W
, which represents a non-word character (anything other than a letter or digit) are often used with this function. A list of useful regular expressions for tokenization can be found in Appendix A, in the “tokenize” section. Table 18-7 shows some examples of the tokenize
function.
Example | Return value |
---|---|
tokenize("a b c", "s") | ("a", "b", "c") |
tokenize("a b c", "s+") | ("a", "b", "c") |
tokenize("a-b--c", "-") | ("a", "b", "", "c") |
tokenize("-a-b-", "-") | ("", "a", "b", "") |
tokenize("a/ b/ c", "[/s]+") | ("a", "b", "c") |
tokenize("2015-12-25T12:15:00", "[-T:]") | ("2015", "12", "25", "12", "15", "00") |
tokenize("Hello, there.", "W+") | ("Hello", "there") |
The analyze-string
function can also be used to split apart strings. This function is especially useful if you want to keep both the matching and non-matching parts of the string (as opposed to tokenize
, which throws away the delimiters).
In order to provide a structured result that contains both matches and non-matches, the analyze-string
function returns an XML element named fn:analyze-string-result
that contains elements called fn:match
for each part of a string that matches the regular expression, and fn:non-match
for each part that does not match.
For example, the following:
analyze-string("can be reached at 231-555-1212 or", "d{3}-d{3}-d{4}")
will return the following XML, which could then be traversed to, for example, tag the phone number but keep the surrounding text:
<fn:analyze-string-result xmlns:fn="http://www.w3.org/2005/xpath-functions"> <fn:non-match>can be reached at </fn:non-match> <fn:match>231-555-1212</fn:match> <fn:non-match> or</fn:non-match> </fn:analyze-string-result>
Strings can be constructed from a sequence of Unicode codepoint values (expressed as integers) using the codepoints-to-string
function. For example:
codepoints-to-string( (97, 98, 99) )
returns the string abc
. The string-to-codepoints
function performs the opposite; it converts a string to a sequence of codepoints. For example:
string-to-codepoints("abc")
returns a sequence of three integers: 97
, 98
, and 99
.
Four functions can be used to manipulate the characters of a string. They are listed in Table 18-8.
Function name | Description |
---|---|
upper-case | Translates a string into uppercase equivalents |
lower-case | Translates a string into lowercase equivalents |
translate | Replaces individual characters with other individual characters |
replace | Replaces characters that match a regular expression with a specified string |
The upper-case
and lower-case
functions are used to convert a string to all uppercase or lowercase. For example, upper-case("Query")
returns QUERY
. The mappings between lowercase and uppercase characters are determined by Unicode case mappings. If a character does not have a corresponding uppercase or lowercase character, it is included in the result string unchanged. Table 18-9 shows some examples.
Example | Return value |
---|---|
upper-case("query") | QUERY |
upper-case("Query") | QUERY |
lower-case("QUERY-123") | query-123 |
lower-case("Query") | query |
The translate
function is used to replace individual characters in a string with other individual characters. It takes three arguments:
The string to be translated
The list of characters to be replaced (as a string)
The list of replacement characters (as a string)
Each character in the second argument is replaced by the character in the same position in the third argument. For example:
translate("**test**321", "*123", "-abc")
returns the string --test--cba.
If the second argument is longer than the third argument, the extra characters in the second argument are simply omitted from the result. For example:
translate("**test**321", "*123", "-")
returns the string --test--
.
The replace
function is used to replace non-overlapping substrings that match a regular expression with a specified replacement string. It takes three arguments:
The string to be manipulated
The pattern, which uses the regular expression syntax described in Chapter 19
The replacement string
While it is nice to have the power of regular expressions, you don’t have to be familiar with regular expressions to replace a particular sequence of characters; you can simply specify the string you want replaced for the $pattern
argument, as long as it doesn’t contain any special characters.
An optional fourth argument allows for additional options in the interpretation of the regular expression, such as multi-line processing and case sensitivity. Table 18-10 shows some examples.
Example | Return value |
---|---|
replace("query", "r", "as") | queasy |
replace("query", "qu", "quack") | quackery |
replace("query", "[ry]", "l") | quell |
replace("query", "[ry]+", "l") | quel |
replace("query", "z", "a") | query |
replace("query", "query", "") | A zero-length string |
XQuery also supports variables in the replacement text, which allow parenthesized sub-expressions to be referenced by number. You can use the variables $1
through $9
to represent the first nine parenthesized expressions in the pattern. This is very useful when replacing strings, on the condition that they come directly before or after another string. For example, if you want to change instances of the word Chap to the word Sec, but only those that are followed by a space and a digit, you can use the function call:
replace("Chap 2...Chap 3...Chap 4...", "Chap (d)", "Sec $1.0")
which returns Sec 2.0...Sec 3.0...Sec 4.0...
. Sub-expressions are discussed in more detail in “Using Sub-Expressions with Replacement Variables”.
Whitespace handling varies by implementation and depends on whether the implementation uses schema validation, and how it chooses to handle whitespace in element content. Every XML parser normalizes the whitespace in attribute values, replacing carriage returns, line feeds, and tabs with spaces. XML Schema processors may further normalize whitespace of an attribute or element value based on its type. During XML Schema validation, whitespace is preserved in values of type xs:string
(and some of its derived types), but collapsed in all others.
Within string literals in queries, whitespace is always significant. For example, the expression string-length(" x ")
evaluates to 3
, not 1
.
The normalize-space
function collapses whitespace in a string. Specifically, it performs the following steps:
Replaces each carriage return (#xD
), line feed (#xA
), and tab (#x9
) character with a single space (#x20
)
Collapses all consecutive spaces into a single space
Removes all leading and trailing spaces
Table 18-11 shows some examples.
Example | Return value |
---|---|
normalize-space("query") | query |
normalize-space(" query ") | query |
normalize-space("xml query") | xml query |
normalize-space(" xml query ") | xml query |
normalize-space(" ") | A zero-length string |
XML, through its support for Unicode, is designed to allow for many natural languages. XQuery provides several functions and mechanisms that support multiple natural languages: collations, the normalize-unicode
function, and the lang
function.
Collations are used to specify the order in which characters should be compared and sorted. Characters can be sorted simply based on their codepoints, but this has some limitations. Different languages and locales alphabetize the same set of characters differently. In addition, an uppercase letter and its lowercase equivalent may need to be sorted together. For example, if you sort on codepoints alone, an uppercase A
comes after a lowercase z
.
Collations are not just for sorting. They can be used to equate two strings that contain equivalent values. Some languages and locales may consider two different characters or sequences of characters to be equivalent. For example, a collation may equate the German character β with the two letters ss
. This type of comparison comes into play when using, for example, the contains
function, which determines whether one string contains the characters of another string.
Collations in XQuery are identified by URIs. The URI serves only as a name and does not necessarily point to a resource on the Web, although it might.
All XQuery implementations recognize at least three collation URIs:
http://www.w3.org/2005/xpath-functions/collation/codepoint
The Unicode Codepoint Collation that simply compares strings based only on Unicode codepoints.
http://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive
The HTML ASCII case-insensitive collation that compares strings in a case-insensitive manner, where the uppercase letters A
to Z
are considered equivalent to the lowercase letters a
to z
. Other characters are compared based on their Unicode code points. This is defined by HTML and is used, for example, to compare HTML class
attributes.
http://www.w3.org/2013/collation/UCA
The Unicode Collation Algorithm is a far more sophisticated collation. The collation URI can be followed by a number of query parameters, for example:
http://www.w3.org/2013/collation/UCA?lang=se;numeric=yes
indicates that the language is Swedish and that consecutive integers should be sorted as numbers. While processors are required to recognize this collation URI, they are by default not required to fully support it and can use a different collation as a fallback.
Some implementations support additional collations. You should consult the documentation of your XQuery implementation to determine which collations are supported, including more details about what query parameters are supported for the Unicode Collation Algorithm and what fallback collations are used. If an unsupported collation URI is specified in a query, an error is raised.
There are several ways to specify a collation. Some XQuery functions, such as compare
and distinct-values
, accept a $collation
argument that allows you to specify the collation URI. For example:
distinct-values(doc("catalog.xml")//@dept,"http://datypic.com/collation/custom")
In addition, you can specify a collation in the order by
and group by
clauses of a FLWOR. For example:
order by $d collation "http://datypic.com/collation/custom"
or:
group by $d collation "http://datypic.com/collation/custom"
You can also specify a default collation in the query prolog. This default is used by some functions as well as order by
and group by
clauses when no collation
keyword is specified. The default collation is also used in operations that do not allow you to specify collation, such as those using the comparison operators =
, !=
, <
, <=
, >
, and >=
. The syntax of a default collation declaration is shown in Figure 18-1.
An example is:
declare default collation "http://datypic.com/collation/custom";
Regardless of how a collation URI is specified, it must be a literal value in quotes (not an evaluated expression), and it should be a syntactically valid URI. If a relative URI is provided, it is relative to the static base URI, which is described in “Static base URI”.
Alternatively, the implementation may have a built-in default collation, or allow a user to specify one, through means other than the query prolog. The default collation can be obtained using the default-collation
function, which takes no arguments.
As a last resort, if no $collation
argument is provided, no default collation is specified, and the implementation does not provide a default collation, then the simple Unicode Codepoint Collation is used.
Although it is possible in XML to use an xml:lang
attribute to indicate the natural language of character data, use of this attribute has no effect on the collation algorithm used in XQuery. Unlike SQL, the choice of collation depends entirely on the user writing the query, and not on any properties of the data.
Unicode normalization allows text to be compared without regard to subtle variations in character representation. It replaces certain characters with equivalent representations. Two normalized values can then be compared to determine whether they are the same. Unicode normalization is also useful for allowing character strings to be sorted appropriately.
The normalize-unicode
function performs Unicode normalization on a string. It takes two arguments: the string to be normalized and the normalization form to use. The normalization form controls which characters are replaced. Some characters may be replaced by equivalent characters, while others may be decomposed to an equivalent representation that has two or more codepoints.
It is possible to test the language of an element based on the existence of an xml:lang
attribute among its ancestors. This is accomplished using the lang
function.
The lang
function accepts as arguments the language for which to test and, optionally, the node to be tested. The function returns true
if the relevant xml:lang
attribute of the node (or the context node if no second argument is specified) has a value that matches the argument. The function returns false
if the relevant xml:lang
attribute does not match the argument, or if there is no relevant xml:lang
attribute.