So far in this book, the examples have assumed that the inputs to the query are individual XML documents that are accessed through the doc
function. This chapter goes into further detail about the various options for accessing input documents in XQuery. It also describes output documents, including the many different options for serializing query results to files.
While XQuery is typically associated with querying and returning XML, it actually can also work with text files and JSON documents.
XML documents are by far the most common input to XQuery. Technically, the input might not be an entire XML document; it might be a document fragment, such as an element or sequence of elements, possibly with children. It might not be a physical XML file at all; it might be data retrieved from an XML database, or an in-memory XML representation that was generated from non-XML data.
If the input document is physically stored in XML syntax, it must be well-formed XML. This means that it must comply with XML syntax rules, such as that every start tag has an end tag, there is no overlap among elements, and special characters are used appropriately. It must also use namespaces appropriately. This means that if colons are used in element or attribute names, the part before the colon must be a prefix that is bound to a namespace using a namespace declaration.
Whether it is physically stored as an XML document or not, an input document must conform to other constraints on XML documents. For example, an element cannot have two attributes with the same name, and element and attribute names cannot contain special characters other than hyphens, underscores, and periods.
Starting in version 3.0, it is possible to open text documents with XQuery. This may include comma- or tab-delimited formats, fixed-length formats, property files containing name/value pairs, or any other structured text format. The regular expression functions such as tokenize
and analyze-string
make it relatively straightforward to parse these text files into an instance of the XQuery data model that can be queried.
Starting in version 3.1, maps and arrays are added to the XQuery data model, along with the ability to parse JSON documents into these new item types. As with XML, there is no particular requirement that the input document be physically stored in JSON syntax; it could be passed in memory from another process.
A single query can access many input documents. There are several ways that input documents can be accessed from within a query. They are described in this section. Most of these methods have some implementation-defined aspects. You should consult the documentation for your XQuery implementation to determine which of these methods are appropriate for accessing input documents.
The doc
function can be used to open one XML input document based on its URI. It takes as an argument a single URI as a string, and returns the document node of the resource associated with the specified URI. To test if an XML document exists before accessing it, you can call the doc-available
function, which returns true
if you can access the document (without errors) via the doc
function.
For text documents, the unparsed-text
function will open a text document and return the contents of that document as a string. The function unparsed-text-available
can be used to test for the existence of a text document and to ensure there are no encoding or other errors that might arise when accessing it.
The json-doc
function is used to access JSON documents. It typically returns a map that represents the outermost JSON object. There is no separate function for testing for the existence of JSON documents, but the unparsed-text-available
function can be used for this purpose, since a JSON document is a text file.
The doc
, json-doc
, and unparsed-text
functions return a single resource associated with the specified URI. For example:
doc("http://datypic.com/input/order.xml")
returns the document node of the document whose URI is http://datypic.com/input/order.xml
. Relative URI references are also allowed, as in:
doc("order.xml")
If the specified URI is a relative URI, it is resolved based on the static base URI. The static base URI may be set by the processor outside the scope of the query, typically to the URI of the file from which the query was read. Alternatively, it may be declared in the query prolog by using a base URI declaration.
If you are accessing documents on a filesystem, your implementation may require you to precede the filename with file:///
, use forward slashes to separate directory names, and/or escape each space in the filename with %20
. For example:
doc("file:///C:/Documents%20and%20Settings/my%20order.xml")
Processors interpret the URI of the input document in different ways. Some, like Saxon, will dereference the URI, that is, go out to the URL and retrieve the resource at that location. Other implementations, such as those embedded in XML databases, consider the URIs to be just names. The processor might take the name and look it up in an internal catalog to find the document associated with that name. Many processors provide user hooks or configuration options allowing the behavior to be controlled by the application, and the result may also depend on the configuration of the underlying environment (for example, HTTP proxy settings).
Implementations also have some leeway in how they handle errors when retrieving documents, how they handle different MIME types, and whether they validate XML documents against a schema or DTD.
The collection
function returns the items that make up a collection, identified by a URI. Traditionally, collections have contained only document nodes. However, starting in version 3.1, they can contain any items, for example, strings, or maps retrieved from JSON documents.
Exactly how the URI is associated with the items is defined by the implementation. For example, one implementation might accept a URI that is the name of a directory on a filesystem, and return the document nodes of the XML documents stored in that directory. Another implementation might associate a URI with a particular database or subset of a database. A third might allow you to specify the URI of an XML document that contains a sort of “manifest”—a list of URIs for the XML documents in the collection.
The function takes as an argument a single URI. For example, the function call:
collection("http://datypic.com/orders")
might return all the document nodes of the XML documents associated with the collection http://datypic.com/orders
. It is also possible to use the function without any parameters, as in collection()
, to retrieve a default collection as defined by the implementation.
To get a list of the document URIs in a collection, you can use the uri-collection
function. It provides a more efficient way of determining what is in a collection without having to retrieve the whole collection, which may take time and memory.
Another way to access input is to rely on the context being set by the processor outside the query. In this case, it may not be necessary to use functions like doc
or collection
, unless you want to open secondary data sources.
For example, a hypothetical XQuery implementation might allow you to set the context node in the Java code that executes the query, as in:
Document catalogDocument = new Document(new File("catalog.xml")); String query = "catalog/product[@dept = 'ACC']"; List productElements = catalogDocument.evaluateXQuery(query);
In that case, the XQuery expression catalog/product
might be evaluated in the context of the catalog document node itself. If the processor had not set the context node, a path expression starting with catalog/product
would not be valid.
Another implementation might allow you to select a document to query in a user interface, in which case, it uses that document as the context node.
The processor can bind external variables to input documents or document fragments. These variables can then be used in the query to access the input document. For example, an implementation might allow an external variable named $input
to be defined, and allow the user to specify a document to be associated with that variable. The hypothetical query processor could be invoked from the command line using:
xquery -input catalog.xml
and the query could use expressions like $input/catalog/product
to retrieve the product
elements. The name $input
is provided as an example; the implementation could use any name for the variable, or require you to declare it as an external variable in your query.
It is also possible to set the context for a query by using a context item declaration in the prolog. This is essentially a way of identifying the main input source for the query. This method is used in conjunction with, not instead of, the other methods of accessing input documents described in this section. For example:
declare context item := doc("catalog.xml"); for $prod in catalog/product return $prod/number
In this case, the context is set to the document node of the catalog.xml document, so the XPath catalog/product
is evaluated relative to that. The syntax of the context item declaration is shown in Figure 13-1.
Using the external
keyword means that the context is set by the processor outside the query. However, it is possible to specify a default value that is used when no context is set. For example:
declare context item external := doc("catalog.xml"); for $prod in catalog/product return $prod/number
A context item declaration can only appear in a main module (not in a library module) and there can only be one of them.
Serialization is the process of writing the results of a query out, whether it be an XML document, JSON document, or text file, or even a string representing XML or JSON syntax. In your query, you construct (or select) a number of items (for example, XML elements and attributes) to include in the results. These results conform to the data model described in “The XQuery Data Model”. However, the data model does not define the details of the output syntax and format to be used, for example whether to indent the output or how to encode special characters. Some of these details can be controlled with serialization parameters.
Six different serialization methods identify the desired overall output format. Implementations may support additional methods, which are expressed as a qualified name in an implementation-defined namespace. (Implementations are not required to support serialization at all, but most do.)
xml
The default method is xml
, which writes out an XML document. Typically this is a well-formed XML document with a single root element, but it is also possible to serialize a fragment of an XML document, for example a sequence of several elements.
xhtml
The xhtml
output method indicates that the output should be XHTML, meaning a well-formed XML document that uses HTML tags, but that has certain differences from the XML output to ensure browser compatibility, namely:
Elements that should always be empty, like br
and hr
, are serialized using a minimized tag, but with a space before the forward slash, as in <hr />
.
Elements that happen to be empty but could contain content according to the HTML specification are not serialized with a minimized tag, and instead have separate start and end tags, as in <p></p>
.
Indentation is suppressed inside mixed content elements such as p
or td
. This avoids any inadvertent whitespace being introduced.
html
The html
output method indicates that the output should be HTML, and is not necessarily well-formed XML. The differences from the xhtml
output method are:
Elements that should always be empty, like br
and hr
, are serialized with a start tag and no matching end tag, as in <hr>
.
Content of the script
and style
elements is not escaped, even if it normally would be in well-formed XML. For example, the script
element can contain an unescaped <
(less than) character.
Attributes defined as boolean attributes in the HTML specification are output using their minimized form, with an attribute name but no value. For example, <input type="checkbox" checked="checked">
would be output as <input type="checkbox" checked>
because checked
is a boolean attribute.
Predefined HTML character entity references may be used in the output, for example
instead of  
or the non-breaking space character itself.
No XML declaration is included
text
When using the text
output method, the results are converted to their string value. This output method would typically be used to output atomic values (especially strings). The results can include XML structures, but the output will only include the content of the elements, not their tags or attributes. For example, if the results of the query are:
<desc lang="en">Our <i>favorite</i> shirt!</desc>
The output will be:
Our favorite shirt!
Most of the serialization parameters described in the next section do not apply to the text
output method. The ones that are relevant are byte-order-mark
, encoding
, item-separator
, media-type
, normalization-form
, and use-character-maps
.
json
If the json
output method is chosen, the output will be in JSON syntax. Maps in XQuery will be serialized to JSON objects, and arrays will be serialized to JSON arrays. More detail on serializing JSON is provided in “JSON”.
Most of the serialization parameters described in the next section do not apply to the json
output method. The ones that are relevant are allow-duplicate-names
, byte-order-mark
, encoding
, indent
, json-node-output-method
, media-type
, normalization-form
, and use-character-maps
.
adaptive
The purpose of the adaptive
output method is to allow any instance of the XQuery data model to be serialized without errors. This is useful, for example, for debugging, or for displaying the results of a query to a user.
It uses a combination of the other output methods, depending on the item type. It serializes XML nodes as it would using the xml
output method. If it encounters attribute nodes or namespace nodes with no parent, it serializes them as strings. It serializes arrays and maps using the json
output method, with a few minor exceptions to avoid raising serialization errors, such as allowing the value INF
for numbers.
The adaptive
output method also allows for the serialization of function items, which is not supported in any other output method. These are serialized as strings that contain just their name and their arity, for example exists#1
. If it is an anonymous function, the string (anonymous)
is used in place of the name, for example, (anonymous)#2
.
Implementations can also choose to output additional debugging information, such as the types of values, or information about non-fatal serialization errors.
Serialization parameters can be used to control certain syntactic features of the output, such as the encoding used, whether an XML declaration is included at the beginning of an XML document, or whether the results are indented. This section describes the serialization parameters. The next two sections show how to set the values for serialization parameters in your query.
Support for serialization parameters is somewhat implementation-dependent, since not all implementations have to support all parameters, and they can define their own default values and even their own parameters (using names in an implementation-defined namespace). (Default values listed in this section are the ones recommended by XQuery, but may be overridden by your implementation.)
Wherever the value yes
is listed below, the equivalent value true
or 1
can be used instead. Likewise, wherever the value no
is listed below, the equivalent value false
or 0
can be used instead.
allow-duplicate-names
Whether to allow duplicate names in the same object in JSON output. Valid values are yes
or no
; the default is no
. If it is set to no
, a serialization error will be raised if a map contains two keys with the same string value. This could arise if the keys have different typed values in the XQuery map, for example the string 1
and the integer 1
.
byte-order-mark
Whether a byte order mark (BOM) should precede the serialized results. Valid values are yes
or no
.
cdata-section-elements
A space-separated list of qualified element names whose contents should be enclosed in a CDATA section in XML output. For example, if your Docbook output contains some escaped XML examples, the output might be written as:
<programlisting><p>Example p content</p></programlisting>
with escaped less-than characters. If you list the programlisting
element in the cdata-section-elements
parameter, the output will be more readable:
<programlisting><![CDATA[<p>Example p content</p>]]></programlisting>
The output XML document is exactly equivalent; CDATA sections are merely used for convenience when editing documents.
doctype-public
A public identifier to be included in a document type declaration of the output. For example, specifying -//NLM//DTD Journal Publishing DTD v2.3 20070202//EN
as the value of this parameter and journalpublishing.dtd
as the value of the doctype-system
parameter will result in the following line appearing at the beginning of the output (assuming article
is the outermost element):
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
This parameter can be used alone (if the output method is html
), or in conjunction with the doctype-system
parameter.
doctype-system
A system identifier to be included in a document type declaration of the output, to indicate the location of a DTD. For example, specifying journalpublishing.dtd
as the value of this parameter will result in the following line appearing at the beginning of the output (assuming article
is the outermost element):
<!DOCTYPE article SYSTEM "journalpublishing.dtd">
This parameter can be used alone, or in conjunction with the doctype-public
parameter.
encoding
The encoding to be used for the results. The default is either UTF-8
or UTF-16
(which one is implementation-defined). Implementations are required to support at least these two values, and some support other values.
For example, Saxon supports the encoding US-ASCII
, which causes all non-ASCII characters to be output using character references. Using this encoding, the character ©
would appear as ©
in the results.
escape-uri-attributes
Whether to perform URI escaping on attributes that are defined in (X)HTML to be URIs, such as href
and src
. Values are yes
or no
; the default is yes
. This parameter applies only to html
and xhtml
output types.
html-version
A decimal number between 1.0 and 5.0 indicating the version of HTML. If the output method is xhtml
, this parameter refers to the version of HTML, while the version
parameter indicates the version of XML. If the output method is html
, this value is used for the version of HTML if this parameter is specified, otherwise the version
parameter indicates the version of HTML.
include-content-type
Whether to include a meta
element that specifies the content type, as in:
<meta http-equiv="Content-Type" content="text/html; charset=EUC-JP" />
Valid values are yes
or no
; the default is yes
. This parameter applies only to html
and xhtml
output types.
indent
Whether to pretty-print the results, i.e., put line breaks and indenting spaces between elements. Valid values are yes
or no
; the default is no
.
Indentation can be impacted by the output method and by any schemas that you are using with queries. No indentation will be performed on elements that can have mixed content (allow both children and text content) according to a schema against which it has been validated. If the output method is xhtml
or html
, the processor knows which elements can contain mixed content without a schema, and it suppresses indentation for those. It also does not indent elements such as pre
and script
where whitespace can be significant.
However, if no schema is used and the output method is xml
, indentation should be avoided for documents that contain mixed content, because unintended whitespace can be inserted. It is also possible to turn indenting off only for specific elements using the suppress-indentation
parameter.
item-separator
A string to be inserted between every top-level item in the sequence being serialized. This applies to all items (including, for example, element nodes), but it is most useful for text nodes and atomic values. For example, specifying ,
(a comma) for this parameter will cause the result of the query:
doc("catalog.xml")//number/string(.)
to be serialized as 557,563,443,784
. Otherwise, it will be output as 557 563 443 784
, with spaces inserted between the values.
json-node-output-method
The method to use to serialize XML nodes into strings in JSON output, namely one of xml
, xhtml
, html
, or text
. The default is xml
for most implementations. This is described in more detail in “Serializing JSON”. Some implementations will support additional values expressed as QNames.
media-type
The media type (MIME type), as defined by RFC 2046. The charset parameter should not be included. This might be used, for example, if the serialized result were being sent over HTTP. The processor could choose to set the media type in the HTTP header to this value.
method
The type of output, namely one of xml
, xhtml
, html
, text
, json
, or adaptive
. The default is xml
for most implementations. Some implementations will support additional values expressed as QNames.
normalization-form
Valid values are NFC
, NFD
, NFKC
, NFKD
, fully-normalized
, none
, or an implementation-defined value. Unicode normalization is discussed in “Unicode Normalization”.
omit-xml-declaration
Whether to omit the XML declaration in the output. Valid values are yes
or no
. The XML declaration is the optional first line of an XML document that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
standalone
The value of the standalone parameter of the XML declaration. Valid values are yes
, no
, or omit
. Use omit
to not include the standalone parameter.
suppress-indentation
A space-separated list of qualified element names for which to suppress indentation in the output. This is useful for elements that can contain mixed content, where unintended whitespace could be inserted if indentation is used. For example, the following p
element has no whitespace within it:
<p><b>Number:</b><i>1</i></p>
However, when serialized with indentation, the result will be:
<p> <b>Number:</b> <i>1</i> </p>
There is now whitespace between the colon and the number 1, which would appear in an HTML page as Number: 1
instead of Number:1
. To avoid this, the element name p
could be specified in the suppress-indentation
parameter.
A safer, broader approach for output that could contain mixed content is to set the indent
parameter to no
, thus suppressing indentation for all elements.
undeclare-prefixes
When using XML 1.1, use yes
to instruct the processor to insert namespace “undeclarations,” e.g., xmlns:prod=""
, to indicate that a namespace declaration is no longer in scope. This is rarely, if ever, necessary. Valid values are yes
or no
; the default is no
.
use-character-maps
A list of characters and associated replacement strings used to perform character substitution. Each character is replaced by the replacement string during serialization. This is most often used when the output is similar to HTML or XML but needs to be non-well-formed, for example JSP pages or HTML that contains scripts.
version
The version of the output format. If the output method is xml
or xhtml
, the value will be 1.0
or 1.1
, corresponding to the version of XML. For the html
output method, it refers to the version of HTML (a number between 1.0 and 5.0), although that can also be specified in the separate html-version
parameter.
Starting in version 3.0, serialization parameters can be specified in a query prolog by using an option declaration. The syntax of an option declaration is shown in Figure 13-2.
Options have namespace-qualified names, which means that the prefixes used must be declared, and processors recognize them by their namespace. All the serialization parameters have the namespace http://www.w3.org/2010/xslt-xquery-serialization
, and the local name is the name used in the previous section, e.g., version
or byte-order-mark
. When an option declaration uses a name in the serialization namespace, it is also known as an output declaration. Output declarations can only appear in a main module, not in a library module.
Example 13-1 shows how you might use an option declaration to specify serialization parameters. The namespace declaration binds the serialization namespace to the prefix output
. This is required because it is not a predeclared namespace. All the serialization parameter values are in quotes, even the numeric ones like html-version
.
xquery
version
"3.0"
;
declare
namespace
output
=
"http://www.w3.org/2010/xslt-xquery-serialization"
;
declare
namespace
prod
=
"http://datypic.com/prod"
;
declare
option
output:method
"xml"
;
declare
option
output:indent
"yes"
;
declare
option
output:cdata-section-elements
"prod:desc prod:name"
;
declare
option
output:omit-xml-declaration
"yes"
;
The settings have to be unique. For example, it is not allowed to have two different output declarations for the output:indent
parameter. The use-character-maps
parameter is the only one that cannot be specified using an output declaration, because it has a complex structure. It must be specified using a separate XML document, as described in the next section.
Option declarations are not just for serialization parameters, although that is their most common use. They can be used for other implementation-defined settings, as described in “The Option Declaration”.
Serialization parameters can also be specified in a separate XML document, whose URI is provided as the value of output:parameter-document
in the prolog, as shown in Example 13-2. If a relative URI is provided, it is evaluated relative to the static base URI.
xquery
version
"3.0"
;
declare
namespace
output
=
"http://www.w3.org/2010/xslt-xquery-serialization"
;
declare
option
output:parameter-document
"parameters.xml"
;
Example 13-3 shows an example of the XML structure of the parameters document. It consists of a root output:serialization-parameters
element with a child for each serialization parameter that needs to be specified. The output:use-character-maps
element has one output:character-map
child for each character in the map.
<output:serialization-parameters
xmlns:output=
"http://www.w3.org/2010/xslt-xquery-serialization"
xmlns:prod=
"http://datypic.com/prod"
>
<output:method
value=
"xml"
/>
<output:version
value=
"1.0"
/>
<output:indent
value=
"yes"
/>
<output:cdata-section-elements
value=
"prod:desc prod:name"
/>
<output:use-character-maps>
<output:character-map
character=
"«"
map-string=
"<%"
/>
<output:character-map
character=
"»"
map-string=
"%>"
/>
</output:use-character-maps>
</output:serialization-parameters>
As with output declarations, the settings in the separate XML document have to be unique. The characters used in the character map also have to be unique. Additional implementation-defined serialization parameters can be specified using XML elements in other namespaces.
Two functions, serialize
and transform
, allow you to specify serialization parameters as a map. For the serialize
function, the map is the second argument ($params
). For the transform
function, the map is passed as the value of a map entry whose key is serialization-params
as part of its $options
argument.
The map can have an entry for each of the serialization parameters, where the key is the parameter name as a string, with the following valid values:
For serialization parameters that allow yes
or no
(allow-duplicate-names
, byte-order-mark
, escape-uri-attributes
, include-content-type
, indent
, omit-xml-declaration
, standalone
, undeclare-prefixes
), an xs:boolean
value where true()
represents yes
and false()
represents no
. For the standalone
parameter, the empty sequence represents omit
.
For serialization parameters that allow decimal numbers (html-version
), an xs:decimal
value.
For serialization parameters that allow a list of QNames (cdata-section-elements
, suppress-indentation
), a sequence of xs:QName
values.
For method
and json-node-output-method
, an xs:string
or xs:QName
value, as appropriate.
For use-character-maps
, a map with one entry per character to be mapped, where the key is the character, and the value is the string to be substituted (both xs:string
values).
For all other serialization parameters, an xs:string
value.
Additional implementation-defined entries can appear in the map. They are identified with keys that are namespace-qualified names (QNames) that are in an implementation-defined namespace.
Example 13-4 shows a parameter map that is equivalent to the XML document in the previous example. The map is then used in a call to the serialize
function.
declare
namespace
prod
=
"http://datypic.com/prod"
;
let
$
map
:=
map
{
"method"
:
"xml"
,
"version"
:
"1.0"
,
"indent"
:
true
(),
"cdata-section-elements"
:
(
xs:QName
(
"prod:desc"
),
xs:QName
(
"prod:name"
)),
"use-character-maps"
:
map
{
"«"
:
"<%"
,
"»"
:
">%"
}
}
let
$
element
:=
<prod:name>
Fleece Pullover
</prod:name>
return
serialize
(
$
element
,
$
map
)
Errors occasionally occur during serialization. They may be the result of conflicting serialization parameters or a query that returns results that cannot be serialized. For example:
doc("catalog.xml")//@dept
is a perfectly valid query, but it will return a sequence of attribute nodes. This cannot be serialized using the xml
output method and will raise error SENR0001
. Serialization errors all start with the letters SE and are listed in Appendix C.
So far this section has described the serialization of the results of the query to a file. It is also possible to serialize a sequence of items to a string by using the serialize
function. For example, the function call:
serialize(doc("catalog.xml")//product[1]/name)
will return the string:
"<name language="en">Fleece Pullover</name>"
The two-argument version of this function allows you to specify values for the serialization parameters. The second argument can take the form of an XML document as described in “Specifying Serialization Parameters by Using a Separate XML Document”, or a map that contains values for the parameters. This function is described in more detail, with examples, in Appendix A, within the “serialize” section.