Chapter 13. Inputs and Outputs

So far in this book, the examples have assumed that the inputs to the query are individual XML documents that are accessed through the doc function. This chapter goes into further detail about the various options for accessing input documents in XQuery. It also describes output documents, including the many different options for serializing query results to files.

Types of Input and Output Documents

While XQuery is typically associated with querying and returning XML, it actually can also work with text files and JSON documents.

XML

XML documents are by far the most common input to XQuery. Technically, the input might not be an entire XML document; it might be a document fragment, such as an element or sequence of elements, possibly with children. It might not be a physical XML file at all; it might be data retrieved from an XML database, or an in-memory XML representation that was generated from non-XML data.

If the input document is physically stored in XML syntax, it must be well-formed XML. This means that it must comply with XML syntax rules, such as that every start tag has an end tag, there is no overlap among elements, and special characters are used appropriately. It must also use namespaces appropriately. This means that if colons are used in element or attribute names, the part before the colon must be a prefix that is bound to a namespace using a namespace declaration.

Whether it is physically stored as an XML document or not, an input document must conform to other constraints on XML documents. For example, an element cannot have two attributes with the same name, and element and attribute names cannot contain special characters other than hyphens, underscores, and periods.

Text

Starting in version 3.0, it is possible to open text documents with XQuery. This may include comma- or tab-delimited formats, fixed-length formats, property files containing name/value pairs, or any other structured text format. The regular expression functions such as tokenize and analyze-string make it relatively straightforward to parse these text files into an instance of the XQuery data model that can be queried.

JSON

Starting in version 3.1, maps and arrays are added to the XQuery data model, along with the ability to parse JSON documents into these new item types. As with XML, there is no particular requirement that the input document be physically stored in JSON syntax; it could be passed in memory from another process.

Accessing Input Documents

A single query can access many input documents. There are several ways that input documents can be accessed from within a query. They are described in this section. Most of these methods have some implementation-defined aspects. You should consult the documentation for your XQuery implementation to determine which of these methods are appropriate for accessing input documents.

Accessing a Single Document with a Function

The doc function can be used to open one XML input document based on its URI. It takes as an argument a single URI as a string, and returns the document node of the resource associated with the specified URI. To test if an XML document exists before accessing it, you can call the doc-available function, which returns true if you can access the document (without errors) via the doc function.

For text documents, the unparsed-text function will open a text document and return the contents of that document as a string. The function unparsed-text-available can be used to test for the existence of a text document and to ensure there are no encoding or other errors that might arise when accessing it.

The json-doc function is used to access JSON documents. It typically returns a map that represents the outermost JSON object. There is no separate function for testing for the existence of JSON documents, but the unparsed-text-available function can be used for this purpose, since a JSON document is a text file.

Resolving URIs of input documents

The doc, json-doc, and unparsed-text functions return a single resource associated with the specified URI. For example:

doc("http://datypic.com/input/order.xml")

returns the document node of the document whose URI is http://datypic.com/input/order.xml. Relative URI references are also allowed, as in:

doc("order.xml")

If the specified URI is a relative URI, it is resolved based on the static base URI. The static base URI may be set by the processor outside the scope of the query, typically to the URI of the file from which the query was read. Alternatively, it may be declared in the query prolog by using a base URI declaration.

If you are accessing documents on a filesystem, your implementation may require you to precede the filename with file:///, use forward slashes to separate directory names, and/or escape each space in the filename with %20. For example:

doc("file:///C:/Documents%20and%20Settings/my%20order.xml")

Processors interpret the URI of the input document in different ways. Some, like Saxon, will dereference the URI, that is, go out to the URL and retrieve the resource at that location. Other implementations, such as those embedded in XML databases, consider the URIs to be just names. The processor might take the name and look it up in an internal catalog to find the document associated with that name. Many processors provide user hooks or configuration options allowing the behavior to be controlled by the application, and the result may also depend on the configuration of the underlying environment (for example, HTTP proxy settings).

Implementations also have some leeway in how they handle errors when retrieving documents, how they handle different MIME types, and whether they validate XML documents against a schema or DTD.

Accessing a Collection

The collection function returns the items that make up a collection, identified by a URI. Traditionally, collections have contained only document nodes. However, starting in version 3.1, they can contain any items, for example, strings, or maps retrieved from JSON documents.

Exactly how the URI is associated with the items is defined by the implementation. For example, one implementation might accept a URI that is the name of a directory on a filesystem, and return the document nodes of the XML documents stored in that directory. Another implementation might associate a URI with a particular database or subset of a database. A third might allow you to specify the URI of an XML document that contains a sort of “manifest”—a list of URIs for the XML documents in the collection.

The function takes as an argument a single URI. For example, the function call:

collection("http://datypic.com/orders")

might return all the document nodes of the XML documents associated with the collection http://datypic.com/orders. It is also possible to use the function without any parameters, as in collection(), to retrieve a default collection as defined by the implementation.

To get a list of the document URIs in a collection, you can use the uri-collection function. It provides a more efficient way of determining what is in a collection without having to retrieve the whole collection, which may take time and memory.

Setting the Context Outside the Query

Another way to access input is to rely on the context being set by the processor outside the query. In this case, it may not be necessary to use functions like doc or collection, unless you want to open secondary data sources.

For example, a hypothetical XQuery implementation might allow you to set the context node in the Java code that executes the query, as in:

Document catalogDocument = new Document(new File("catalog.xml"));
String query = "catalog/product[@dept = 'ACC']";
List productElements = catalogDocument.evaluateXQuery(query);

In that case, the XQuery expression catalog/product might be evaluated in the context of the catalog document node itself. If the processor had not set the context node, a path expression starting with catalog/product would not be valid.

Another implementation might allow you to select a document to query in a user interface, in which case, it uses that document as the context node.

Using Variables

The processor can bind external variables to input documents or document fragments. These variables can then be used in the query to access the input document. For example, an implementation might allow an external variable named $input to be defined, and allow the user to specify a document to be associated with that variable. The hypothetical query processor could be invoked from the command line using:

xquery -input catalog.xml

and the query could use expressions like $input/catalog/product to retrieve the product elements. The name $input is provided as an example; the implementation could use any name for the variable, or require you to declare it as an external variable in your query.

Setting the Context in the Prolog

It is also possible to set the context for a query by using a context item declaration in the prolog. This is essentially a way of identifying the main input source for the query. This method is used in conjunction with, not instead of, the other methods of accessing input documents described in this section. For example:

declare context item := doc("catalog.xml");
for $prod in catalog/product
return $prod/number

In this case, the context is set to the document node of the catalog.xml document, so the XPath catalog/product is evaluated relative to that. The syntax of the context item declaration is shown in Figure 13-1.

Figure 13-1. Syntax of a context item declaration

Using the external keyword means that the context is set by the processor outside the query. However, it is possible to specify a default value that is used when no context is set. For example:

declare context item external := doc("catalog.xml");
for $prod in catalog/product
return $prod/number

A context item declaration can only appear in a main module (not in a library module) and there can only be one of them.

Serializing Output

Serialization is the process of writing the results of a query out, whether it be an XML document, JSON document, or text file, or even a string representing XML or JSON syntax. In your query, you construct (or select) a number of items (for example, XML elements and attributes) to include in the results. These results conform to the data model described in “The XQuery Data Model”. However, the data model does not define the details of the output syntax and format to be used, for example whether to indent the output or how to encode special characters. Some of these details can be controlled with serialization parameters.

Serialization Methods

Six different serialization methods identify the desired overall output format. Implementations may support additional methods, which are expressed as a qualified name in an implementation-defined namespace. (Implementations are not required to support serialization at all, but most do.)

xml

The default method is xml, which writes out an XML document. Typically this is a well-formed XML document with a single root element, but it is also possible to serialize a fragment of an XML document, for example a sequence of several elements.

xhtml

The xhtml output method indicates that the output should be XHTML, meaning a well-formed XML document that uses HTML tags, but that has certain differences from the XML output to ensure browser compatibility, namely:

  • A meta element is inserted that indicates the content type.

  • Elements that should always be empty, like br and hr, are serialized using a minimized tag, but with a space before the forward slash, as in <hr />.

  • Elements that happen to be empty but could contain content according to the HTML specification are not serialized with a minimized tag, and instead have separate start and end tags, as in <p></p>.

  • Indentation is suppressed inside mixed content elements such as p or td. This avoids any inadvertent whitespace being introduced.

html

The html output method indicates that the output should be HTML, and is not necessarily well-formed XML. The differences from the xhtml output method are:

  • Elements that should always be empty, like br and hr, are serialized with a start tag and no matching end tag, as in <hr>.

  • Content of the script and style elements is not escaped, even if it normally would be in well-formed XML. For example, the script element can contain an unescaped < (less than) character.

  • Attributes defined as boolean attributes in the HTML specification are output using their minimized form, with an attribute name but no value. For example, <input type="checkbox" checked="checked"> would be output as <input type="checkbox" checked> because checked is a boolean attribute.

  • Predefined HTML character entity references may be used in the output, for example &nbsp; instead of &#160; or the non-breaking space character itself.

  • No XML declaration is included

text

When using the text output method, the results are converted to their string value. This output method would typically be used to output atomic values (especially strings). The results can include XML structures, but the output will only include the content of the elements, not their tags or attributes. For example, if the results of the query are:

<desc lang="en">Our <i>favorite</i> shirt!</desc>

The output will be:

Our favorite shirt!

Most of the serialization parameters described in the next section do not apply to the text output method. The ones that are relevant are byte-order-mark, encoding, item-separator, media-type, normalization-form, and use-character-maps.

json

If the json output method is chosen, the output will be in JSON syntax. Maps in XQuery will be serialized to JSON objects, and arrays will be serialized to JSON arrays. More detail on serializing JSON is provided in “JSON”.

Most of the serialization parameters described in the next section do not apply to the json output method. The ones that are relevant are allow-duplicate-names, byte-order-mark, encoding, indent, json-node-output-method, media-type, normalization-form, and use-character-maps.

adaptive

The purpose of the adaptive output method is to allow any instance of the XQuery data model to be serialized without errors. This is useful, for example, for debugging, or for displaying the results of a query to a user.

It uses a combination of the other output methods, depending on the item type. It serializes XML nodes as it would using the xml output method. If it encounters attribute nodes or namespace nodes with no parent, it serializes them as strings. It serializes arrays and maps using the json output method, with a few minor exceptions to avoid raising serialization errors, such as allowing the value INF for numbers.

The adaptive output method also allows for the serialization of function items, which is not supported in any other output method. These are serialized as strings that contain just their name and their arity, for example exists#1. If it is an anonymous function, the string (anonymous) is used in place of the name, for example, (anonymous)#2.

Implementations can also choose to output additional debugging information, such as the types of values, or information about non-fatal serialization errors.

Serialization Parameters

Serialization parameters can be used to control certain syntactic features of the output, such as the encoding used, whether an XML declaration is included at the beginning of an XML document, or whether the results are indented. This section describes the serialization parameters. The next two sections show how to set the values for serialization parameters in your query.

Support for serialization parameters is somewhat implementation-dependent, since not all implementations have to support all parameters, and they can define their own default values and even their own parameters (using names in an implementation-defined namespace). (Default values listed in this section are the ones recommended by XQuery, but may be overridden by your implementation.)

Wherever the value yes is listed below, the equivalent value true or 1 can be used instead. Likewise, wherever the value no is listed below, the equivalent value false or 0 can be used instead.

allow-duplicate-names

Whether to allow duplicate names in the same object in JSON output. Valid values are yes or no; the default is no. If it is set to no, a serialization error will be raised if a map contains two keys with the same string value. This could arise if the keys have different typed values in the XQuery map, for example the string 1 and the integer 1.

byte-order-mark

Whether a byte order mark (BOM) should precede the serialized results. Valid values are yes or no.

cdata-section-elements

A space-separated list of qualified element names whose contents should be enclosed in a CDATA section in XML output. For example, if your Docbook output contains some escaped XML examples, the output might be written as:

<programlisting>&lt;p>Example p content&lt;/p></programlisting>

with escaped less-than characters. If you list the programlisting element in the cdata-section-elements parameter, the output will be more readable:

<programlisting><![CDATA[<p>Example p content</p>]]></programlisting>

The output XML document is exactly equivalent; CDATA sections are merely used for convenience when editing documents.

doctype-public

A public identifier to be included in a document type declaration of the output. For example, specifying -//NLM//DTD Journal Publishing DTD v2.3 20070202//EN as the value of this parameter and journalpublishing.dtd as the value of the doctype-system parameter will result in the following line appearing at the beginning of the output (assuming article is the outermost element):

<!DOCTYPE article 
PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" 
       "journalpublishing.dtd">

This parameter can be used alone (if the output method is html), or in conjunction with the doctype-system parameter.

doctype-system

A system identifier to be included in a document type declaration of the output, to indicate the location of a DTD. For example, specifying journalpublishing.dtd as the value of this parameter will result in the following line appearing at the beginning of the output (assuming article is the outermost element):

<!DOCTYPE article SYSTEM "journalpublishing.dtd">

This parameter can be used alone, or in conjunction with the doctype-public parameter.

encoding

The encoding to be used for the results. The default is either UTF-8 or UTF-16 (which one is implementation-defined). Implementations are required to support at least these two values, and some support other values.

For example, Saxon supports the encoding US-ASCII, which causes all non-ASCII characters to be output using character references. Using this encoding, the character © would appear as &#xa9; in the results.

escape-uri-attributes

Whether to perform URI escaping on attributes that are defined in (X)HTML to be URIs, such as href and src. Values are yes or no; the default is yes. This parameter applies only to html and xhtml output types.

html-version

A decimal number between 1.0 and 5.0 indicating the version of HTML. If the output method is xhtml, this parameter refers to the version of HTML, while the version parameter indicates the version of XML. If the output method is html, this value is used for the version of HTML if this parameter is specified, otherwise the version parameter indicates the version of HTML.

include-content-type

Whether to include a meta element that specifies the content type, as in:

<meta http-equiv="Content-Type" content="text/html; charset=EUC-JP" />

Valid values are yes or no; the default is yes. This parameter applies only to html and xhtml output types.

indent

Whether to pretty-print the results, i.e., put line breaks and indenting spaces between elements. Valid values are yes or no; the default is no.

Indentation can be impacted by the output method and by any schemas that you are using with queries. No indentation will be performed on elements that can have mixed content (allow both children and text content) according to a schema against which it has been validated. If the output method is xhtml or html, the processor knows which elements can contain mixed content without a schema, and it suppresses indentation for those. It also does not indent elements such as pre and script where whitespace can be significant.

However, if no schema is used and the output method is xml, indentation should be avoided for documents that contain mixed content, because unintended whitespace can be inserted. It is also possible to turn indenting off only for specific elements using the suppress-indentation parameter.

item-separator

A string to be inserted between every top-level item in the sequence being serialized. This applies to all items (including, for example, element nodes), but it is most useful for text nodes and atomic values. For example, specifying , (a comma) for this parameter will cause the result of the query:

doc("catalog.xml")//number/string(.)

to be serialized as 557,563,443,784. Otherwise, it will be output as 557 563 443 784, with spaces inserted between the values.

json-node-output-method

The method to use to serialize XML nodes into strings in JSON output, namely one of xml, xhtml, html, or text. The default is xml for most implementations. This is described in more detail in “Serializing JSON”. Some implementations will support additional values expressed as QNames.

media-type

The media type (MIME type), as defined by RFC 2046. The charset parameter should not be included. This might be used, for example, if the serialized result were being sent over HTTP. The processor could choose to set the media type in the HTTP header to this value.

method

The type of output, namely one of xml, xhtml, html, text, json, or adaptive. The default is xml for most implementations. Some implementations will support additional values expressed as QNames.

normalization-form

Valid values are NFC, NFD, NFKC, NFKD, fully-normalized, none, or an implementation-defined value. Unicode normalization is discussed in “Unicode Normalization”.

omit-xml-declaration

Whether to omit the XML declaration in the output. Valid values are yes or no. The XML declaration is the optional first line of an XML document that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
standalone

The value of the standalone parameter of the XML declaration. Valid values are yes, no, or omit. Use omit to not include the standalone parameter.

suppress-indentation

A space-separated list of qualified element names for which to suppress indentation in the output. This is useful for elements that can contain mixed content, where unintended whitespace could be inserted if indentation is used. For example, the following p element has no whitespace within it:

<p><b>Number:</b><i>1</i></p>

However, when serialized with indentation, the result will be:

<p>
  <b>Number:</b>
  <i>1</i>
</p>

There is now whitespace between the colon and the number 1, which would appear in an HTML page as Number: 1 instead of Number:1. To avoid this, the element name p could be specified in the suppress-indentation parameter.

A safer, broader approach for output that could contain mixed content is to set the indent parameter to no, thus suppressing indentation for all elements.

undeclare-prefixes

When using XML 1.1, use yes to instruct the processor to insert namespace “undeclarations,” e.g., xmlns:prod="", to indicate that a namespace declaration is no longer in scope. This is rarely, if ever, necessary. Valid values are yes or no; the default is no.

use-character-maps

A list of characters and associated replacement strings used to perform character substitution. Each character is replaced by the replacement string during serialization. This is most often used when the output is similar to HTML or XML but needs to be non-well-formed, for example JSP pages or HTML that contains scripts.

version

The version of the output format. If the output method is xml or xhtml, the value will be 1.0 or 1.1, corresponding to the version of XML. For the html output method, it refers to the version of HTML (a number between 1.0 and 5.0), although that can also be specified in the separate html-version parameter.

Specifying Serialization Parameters by Using Option Declarations

Starting in version 3.0, serialization parameters can be specified in a query prolog by using an option declaration. The syntax of an option declaration is shown in Figure 13-2.

Figure 13-2. Syntax of an option declaration

Options have namespace-qualified names, which means that the prefixes used must be declared, and processors recognize them by their namespace. All the serialization parameters have the namespace http://www.w3.org/2010/xslt-xquery-serialization, and the local name is the name used in the previous section, e.g., version or byte-order-mark. When an option declaration uses a name in the serialization namespace, it is also known as an output declaration. Output declarations can only appear in a main module, not in a library module.

Example 13-1 shows how you might use an option declaration to specify serialization parameters. The namespace declaration binds the serialization namespace to the prefix output. This is required because it is not a predeclared namespace. All the serialization parameter values are in quotes, even the numeric ones like html-version.

Example 13-1. Output declarations
xquery version "3.0";
declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization"; 
declare namespace prod = "http://datypic.com/prod"; 
declare option output:method "xml"; 
declare option output:indent "yes"; 
declare option output:cdata-section-elements "prod:desc prod:name"; 
declare option output:omit-xml-declaration "yes";

The settings have to be unique. For example, it is not allowed to have two different output declarations for the output:indent parameter. The use-character-maps parameter is the only one that cannot be specified using an output declaration, because it has a complex structure. It must be specified using a separate XML document, as described in the next section.

Option declarations are not just for serialization parameters, although that is their most common use. They can be used for other implementation-defined settings, as described in “The Option Declaration”.

Specifying Serialization Parameters by Using a Separate XML Document

Serialization parameters can also be specified in a separate XML document, whose URI is provided as the value of output:parameter-document in the prolog, as shown in Example 13-2. If a relative URI is provided, it is evaluated relative to the static base URI.

Example 13-2. Specifying the name of a parameters document
xquery version "3.0";
declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization"; 
declare option output:parameter-document "parameters.xml";

Example 13-3 shows an example of the XML structure of the parameters document. It consists of a root output:serialization-parameters element with a child for each serialization parameter that needs to be specified. The output:use-character-maps element has one output:character-map child for each character in the map.

Example 13-3. Serialization parameters in a separate XML document (parameters.xml)
<output:serialization-parameters
   xmlns:output="http://www.w3.org/2010/xslt-xquery-serialization"
   xmlns:prod="http://datypic.com/prod">
  <output:method value="xml"/>
  <output:version value="1.0"/>
  <output:indent value="yes"/>
  <output:cdata-section-elements value="prod:desc prod:name"/>
  <output:use-character-maps>
    <output:character-map character="«" map-string="&lt;%"/>
    <output:character-map character="»" map-string="%&gt;"/>
  </output:use-character-maps>
</output:serialization-parameters>

As with output declarations, the settings in the separate XML document have to be unique. The characters used in the character map also have to be unique. Additional implementation-defined serialization parameters can be specified using XML elements in other namespaces.

Specifying Serialization Parameters by Using a Map

Two functions, serialize and transform, allow you to specify serialization parameters as a map. For the serialize function, the map is the second argument ($params). For the transform function, the map is passed as the value of a map entry whose key is serialization-params as part of its $options argument.

The map can have an entry for each of the serialization parameters, where the key is the parameter name as a string, with the following valid values:

  • For serialization parameters that allow yes or no (allow-duplicate-names, byte-order-mark, escape-uri-attributes, include-content-type, indent, omit-xml-declaration, standalone, undeclare-prefixes), an xs:boolean value where true() represents yes and false() represents no. For the standalone parameter, the empty sequence represents omit.

  • For serialization parameters that allow decimal numbers (html-version), an xs:decimal value.

  • For serialization parameters that allow a list of QNames (cdata-section-elements, suppress-indentation), a sequence of xs:QName values.

  • For method and json-node-output-method, an xs:string or xs:QName value, as appropriate.

  • For use-character-maps, a map with one entry per character to be mapped, where the key is the character, and the value is the string to be substituted (both xs:string values).

  • For all other serialization parameters, an xs:string value.

Additional implementation-defined entries can appear in the map. They are identified with keys that are namespace-qualified names (QNames) that are in an implementation-defined namespace.

Example 13-4 shows a parameter map that is equivalent to the XML document in the previous example. The map is then used in a call to the serialize function.

Example 13-4. Serialization parameters in a map
declare namespace prod = "http://datypic.com/prod"; 
let $map := map {
   "method": "xml",
   "version": "1.0",
   "indent": true(),
   "cdata-section-elements": (xs:QName("prod:desc"),xs:QName("prod:name")),
   "use-character-maps": map {
                           "«":"&lt;%",
                           "»":"&gt;%"
                         }
}
let $element := <prod:name>Fleece Pullover</prod:name>
return serialize($element,$map)

Serialization Errors

Errors occasionally occur during serialization. They may be the result of conflicting serialization parameters or a query that returns results that cannot be serialized. For example:

doc("catalog.xml")//@dept

is a perfectly valid query, but it will return a sequence of attribute nodes. This cannot be serialized using the xml output method and will raise error SENR0001. Serialization errors all start with the letters SE and are listed in Appendix C.

Serializing to a String

So far this section has described the serialization of the results of the query to a file. It is also possible to serialize a sequence of items to a string by using the serialize function. For example, the function call:

serialize(doc("catalog.xml")//product[1]/name)

will return the string:

"<name language="en">Fleece Pullover</name>"

The two-argument version of this function allows you to specify values for the serialization parameters. The second argument can take the form of an XML document as described in “Specifying Serialization Parameters by Using a Separate XML Document”, or a map that contains values for the parameters. This function is described in more detail, with examples, in Appendix A, within the “serialize” section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset