Documents are the heart of XML. Any amount of usable XML is presented as a document, often stored in a file. One of the very first things you must understand in order to use XML is how to create a well-formed document. In this section, we examine the syntactic components of a document, starting with the individual characters and looking at how they are viewed when building larger syntactic constructs. Then we look at the constructs defined for all documents by the XML recommendation.
The XML Specification defines a character as “an atomic unit of text as specified by ISO/IEC 10646.” (Remember, ISO/IEC 10646 is more commonly referred to as Unicode.) Of course, this explanation is exactly what you should say at a party if someone asks. One of the goals of both standardization and XML is to make documents easily understandable by platforms around the globe. As such, simple things like ASCII characters can become quite complex.
Regardless, the specification states that legal characters are “tab, carriage return, line feed,” as well as belonging to the aforementioned Unicode specification. If you were to write an XML parser, the topic of characters and standardization would be of incredible importance to you. For the rest of us, it’s usually enough to choose an XML parser that gets it right.
You can declare the character encoding used in an XML document using the optional XML declaration:
<?xml version="1.0" encoding="UTF-8"?>
For an external entity that is not a document itself, a variation of the XML declaration, called an encoding declaration, is used:
<?xml encoding="UTF-8"?>
More information on the XML declaration is provided in “The Document Prolog” later in this chapter. For now, let’s look at some of the most widely used character sets and encodings. (A character set that can be mapped into Unicode can be considered an encoding of Unicode, even if it does not directly support everything defined in Unicode.)
The American Standard Code for Information Interchange (ASCII) is a 7-bit text format (meaning that it takes a sequence of seven 1’s and 0’s to form a character). ASCII is understood by virtually every computer in use. Unicode extends ASCII, so the first 128 characters of Unicode coincide with the first 128 characters of ASCII.
The character set ISO-8859-1 is also known as Latin-1. The ISO-8859-1 set is very widely used as it contains support for most (but not all) Western European languages. The first 256 characters of Unicode are identical to ISO-8859-1 for compatibility reasons. The first 128 characters of ISO-8859-1 are identical to ASCII. The second 128 are a combination of control characters, special characters, and accented letters. ISO-8859-1 was inspired by DEC Multinational Character Set, but there are a few differences. There are also various ISO-8859-X sets with support for additional languages and characters.
Universal Transformation Format, 8-bit (UTF-8), is documented in IETF RFC 2279 by F. Yergeau. UTF-8 is the most popular complete encoding of Unicode.
UTF-8 extends ASCII to some degree. The first 128 positions of UTF-8 are transparently encoded to their ASCII counterparts. Since Unicode can supposedly support over 2 billion characters (way beyond 128), getting it to fit in a stream of discrete 8-bit bytes requires some encoding. UTF-8 solves this problem by representing each Unicode character with a unique sequence of bytes. In a UTF-8 stream, ASCII characters occupy only one byte in the stream, whereas all other characters are represented by two or more bytes the stream. Your XML declaration using UTF-8 appears as follows:
<?xml version="1.0" encoding="UTF-8"?>
The most detailed information for dealing with UTF-8 encoding comes from the RFC.
The specification states “text consists of intermingled character data and markup.” The main point here is that every character within an XML document is either character data (the actual information content we’re most interested in, such as an address or item quantity), or it is markup (containing all of the special characters needed to create start tags, end tags, entities, comments, CDATA delimiters, DTDs, processing instructions, and declarations). All the characters together constitute text.
Character data in the content of elements is “any string of characters that does not contain the start-delimiter of any markup.” Clearly, it is important to know the difference between the two, since it is markup that allows our programs to interpret the character data correctly.
All markup begins with one of two characters: the less-than sign (<
) and the ampersand (&
). All markup that begins with the
less-than sign ends with the greater-than sign (>
), and markup that begins with an
ampersand ends with a semi-colon (;
). These are the only special characters
you need to be aware of most of the time. In some situations, the
single-quote ( ' ) and double-quote ("
) characters need special attention. This
does not mean that your documents and data cannot include these
characters, only that they require some special encoding in the XML
text. Any Unicode character can be part of the character data.
One result is you’re unable to use literal special
characters such as ampersands (&
), or angle brackets (<
, >
) within your text. For example, the
following would confound an XML processor:
<question>Is 5 < 7 ≤
9?</question>
The text of the question
element contains characters not allowed by the specification. The
<
is expected to start a new
markup component, so the following space is interpreted as a syntax
error. The less-than sign is used to start a variety of markup
constructs, the most common of which are the element start and end
tags. The ampersand is used to mark entity references.
In order to use these special characters within your XML document, you’ll need to encode them using entity or character references. To turn the example into proper XML, we need to use this:
<question>Is 5 < 7 ≣ 9?</question>
Entity references are discussed later in this chapter, although many of
you who have worked with HTML will find them familiar as they include
'
('
), "
("
), <
(<
), and >
(>
). XML allows you to define your own
entities as well, and they can contain more than a single character,
but those four are defined by the XML specification and do not need to
be defined specially for your documents. Character
references are slightly different in that they specify
individual Unicode characters without attempting to use mnemonic
identifiers for them. A character reference you might have seen used
in HTML would be something like ®
(®, the registered trademark
symbol). In XML, the numeric portion of the reference may be given
using hexadecimal digits as well if the letter x
is inserted between the sharp sign and the
first digit. The reference ®
also refers to the registered
trademark symbol. Character references cannot be defined by authors,
and they always refer to Unicode characters by the ordinal value
assigned to them in the Unicode specification.
The XML specification defines several small lexical
details, but perhaps one of the most important is the
name. Names are tokens composed of some
combination of legal characters including letters, digits,
underscores, hyphens, or colons; the first character of a name
cannot be a digit. Name tokens are used for naming anything that
needs a name in XML, including element types, attributes, and
entities. Some names cannot be used in day-to-day XML markup. First,
names beginning with the string xml
(in any mixture of upper- and
lowercase) are “reserved for standardization in [the XML
specification] or future versions of this specification.” Secondly,
when naming your elements, you must avoid use of the colon (:
), as it is the basis for
XML namespaces (a method
of prefixing element names with tokens to give them domain context).
While the XML 1.0 specification allows colons in element and
attribute names, the more recent Namespaces specification assigns a
particular syntactic significance that constrains their use. In
other words, if you’re defining a whole class of elements related
specifically to books, such as bookTitle
or bookAuthor
, its better to use
capitalization, hyphenation, or underscores to separate the words
(such as book_title
, book-title
, or bookTitle
) as opposed to using the colon,
such as book:Title
. Using an
expression like book:Title
leads
XML processors to believe that you are referring to a Title
element within the namespace URI
attached to the local name book
.
Of course, it may be that Namespaces are appropriate for your
application, in which case you should take the time to read the
Namespaces specification very carefully and define any that are
needed.
When working with XML-based markup languages, it can be difficult to know how to treat whitespace. For many applications, whitespace can be handled as just more normal character data, while this is not sufficient for others. The problem most often manifests itself when presentation to the user is being controlled by the application. While the XML specification does not attempt to solve the problem, it does provide a way to include a hint for processing tools and applications that the whitespace in a particular element should be preserved as given, rather than treated as malleable space.
The easiest way to visualize the problem is to consider the way
program source code is most commonly presented in HTML. Most HTML
authors wrap source code in a pre
element:
<pre> def hello( ): print "Hello, world!" </pre>
This is certainly the easiest way to present source code in
HTML. Now consider what happens if, instead of using a pre
element, we use a paragraph, or p
, element:
<p> def hello( ): print "Hello, world!" </p>
This creates a very different effect in most web browsers, typically causing the entire program text to be shown on a single line with only a single space separating each word, even though the example includes multiple lines and multiple adjacent spaces.
The solution looks simple, at least for HTML. Simply use a
pre
element when we want to
preserve whitespace. This obvious solution unfortunately has an
equally obvious problem—it only works for HTML, not for arbitrary
XML-based markup languages. A solution is needed that also works for a
non-HTML document like this:
<Poem> Ode to a node, Nested beneath its tree, Snug as a bug in its XML rug Dreaming of the W3C. </Poem>
How is an XML tool to know that the line breaks and other presentation for a poem are significant?
The XML specification defines an attribute called
xml:space
that you can attach to an
element to communicate to the application that whitespace should be
preserved. It is the responsibility of the client application to act
on this information and indeed preserve whitespace when handling or
formatting the data. A typical compliant XML parser passes the
whitespace from the document through to the application regardless of
whether the xml:space
attribute has
been seen (in either the document or the schema). An application can
use the attribute to determine just what manipulations it can perform
on the document content.
The value of the xml:space
attribute can be either default
or
preserve
. If the value is default
, the application is allowed to treat
the whitespace in whatever way it normally would; the XML
specification imposes no limitations on how the whitespace is affected
in this case. If, however, the value is preserve
, the application is expected to
avoid interfering with the whitespace in the element to which the
attribute applies, as well as all child elements, until it encounters
a child that specifies a value for xml:space
. At that point, the child’s value
for xml:space
takes precedence for
itself and it’s descendents.
The xml:space
attribute can
be used in a couple of different ways. The first is to simply include
it in the document instance, which is sufficient for well-formed XML.
The first line of our poem becomes:
<Poem xml:space="preserve">
While this seems reasonable for small quantities of XML text, it
proves unworkable for large volumes of documents that are edited by
humans. Think about what HTML would be like if we had to always
include a special attribute to get the effect of the pre
element! For this reason, the xml:space
attribute is most often used by
including it in the document schema. In a DTD, we would write
something like this:
<!ATTLIST Poem xml:space (default|preserve) 'preserve'>
Attribute list declarations will be discussed in more detail in Section 2.6.3 later in this chapter.
From a practical point of view, most applications that parse XML
look at the names of the elements to determine what to do with the
character data contained therein. For example, while parsing the text
of a book formatted in XML, you may come across a code
element that tells you to preserve the
whitespace within that section. If you look carefully, however, often
the document type specifies that xml:space
has a default value of preserve
for those elements.
The specification is straightforward where end-of-line
handling is concerned. An XML parser must pass characters to
applications with normalized line endings. That is, any combination of
the hexadecimal characters 0x0D
and
0x0A
, or a standalone 0x0D
character not followed by 0x0A
, is converted to a single 0x0A
character. For the less hexadecimal
among us, it means that typical formatting codes such as
and
are converted to
. And for those of you who have never used
those weird backslash characters, it means that text coming from
platforms that commonly use carriage-returns plus linefeed characters
to terminate lines (such as Windows) is converted to use only linefeed
characters.
An attribute named xml:lang
is
provided by the specification and can be placed inside documents to
indicate the language used in the content. Again, this attribute must
be declared in valid documents, much like xml:space
. The values that can be used
within this attribute are defined in IETF RFC
1766, or in a later version. Most language character codes
have two letters, such as en
for
English, but dialects may be specified using an underscore character
and an additional two-letter code; United States English can be
specified as en_US
, while the
Queen’s English can be specified as en_GB
.
An XML document contains a prolog, which includes everything that precedes the single element that is the document content. The prolog consists of an optional declaration called the XML declaration, followed by an optional Document Type Declaration, followed by any number (including zero!) of comments and processing instructions. So the prolog may completely empty, but often contains the XML declaration as a matter of good form. The Document Type Declaration is required if the document is intended to conform to a DTD.
The XML declaration looks much like a processing instruction, but is slightly different because of a special purpose it serves. Since XML requires that all documents are Unicode — but does not constrain the encoding of the Unicode characters to bytes in the data stream that contains the document — there must be a way to determine that encoding. Some encodings can be recognized by the leading bytes of the data stream. A set of specific rules for determining the encoding from the leading bytes of the data stream is given as part of the XML recommendation. For many encodings however, that is not possible. The XML specification states that in those cases where the encoding is not known a priori (as when the encoding is returned in the headers of an HTTP response), the document must be encoded in UTF-8 or include an XML declaration that specifies the encoding. The declaration always includes the version of the XML specification with which the document conforms (only XML 1.0 has been defined at this time). A typical XML declaration would look like:
<?xml version="1.0" encoding="iso-8859-1"?>
This declares that the document is encoded in the character set ISO 8859-1, more commonly known as Latin-1. It’s entirely legal to omit the encoding from the declaration as well, so the minimal declaration looks like this:
<?xml version="1.0"?>
I’m sure this already appears on coffee mugs.
After the XML declaration, a Document Type Declaration may appear. Note that this is different from the Document Type Definition, although the first two words and obvious abbreviations are the same. To avoid confusion, the acronym “DTD” is never used to refer to this; it is usually called the “DOCTYPE declaration.” If given, this declaration specifies the name of the document element, and may specify both internal and external components of the DTD. Let’s look at the simplest form of this declaration:
<!DOCTYPE book>
This tells us that the document element is of the type named
book
, but nothing else; this is not
very useful by itself. There are actually two additional components to
this declaration, each of which is optional, but one or both must be
provided for the declaration to be particularly useful. Let’s look at
an example that contains both of these components:
<!DOCTYPE book SYSTEM "http://xml.example.com/dtds/book.dtd" [ <!ENTITY myCompany "Super Mega Ultra Corporation"> ]>
Here, we include a specification for an external subset of the
DTD (the SYSTEM
and the quoted
string), and an internal subset enclosed in brackets.
If the Document Type Declaration is given, the name of the
document type must match the name of the root element. If you declare
your document type as <!DOCTYPE
Tool [...]>
, then your root
element must be Tool
. Furthermore,
all the specific relationships in the DTD concerning nesting,
character data, and attributes must be enforced against the document
if it is to pass the test for validity.
If you decide to use both the internal and external subsets, the internal subset overrules the external. That is, the rules contained within the DTD inside your XML document prevails over rules for the same construct in an external DTD subset.
An element’s name communicates its type. The attributes
contained within a start tag are not recognized in any particular
order. The specification sees no difference between <name
first="Chris"
last="Jones">
and <name
last
=
"Jones"
first
=
"Chris">
.
There are several constraints to keep in mind when working with
tags. First, there is a constraint on attributes: they must be unique.
No attribute name can appear twice in the same start tag. Next, if the
document is to be considered valid, the attributes must have been
declared, and the values must be of the types specified. Additionally,
attribute values cannot be, nor can they contain, external entity
references. Finally, an attribute of a start tag, or its entity
replacement text, must not contain the character <
. As for end tags, the specification
requires only that they exactly match the start tag’s name. Attributes
are not allowed in end tags.
Elements can contain just about any type of character data, as long as it is not confused with surrounding XML markup itself. This has been addressed earlier in this chapter in Section 2.5.2.
Empty elements are elements without content. They may contain attributes as shown in this example:
<names> <name first="Chris" last="Jones"/> <name/> </names>
This XML represents two well-formed name
elements. Both are empty, but the first
expresses two attributes as well.
The specification defines literals as “any quoted string not containing the quotation mark used as a delimiter for that string.” Functionally, literals are used to indicate the content for an internal entity and the values of attributes. Typically, attribute and value combinations look like this:
<account refnum="23908403"/>
In this example, refnum
is
an attribute of the account
element and has a value of 23908403
. Either single or double
quotation marks may be used, with the restriction that whichever is
used to quote the value may not be directly used in the value,
though it may be included using entity references or numeric
character references.
As an example of an attribute value that contains both types of quotation marks, let’s use this phrase:
The cat said “The dog yelled `Help!,’ then I pounced.”
Encoded as an attribute, we end up with this:
<talltale text= 'The cat said "The dog yelled 'Help!,' then I pounced."' />
Comments in XML are similar to comments in HTML. The specification states that comments can reside anywhere outside of other markup. A simple XML comment looks like this:
<!-- This is a comment. -->
Since comments are not allowed inside other markup, you can’t embed a comment inside an XML start tag:
<book name="Python and XML" <!--comment here-->>
This type of expression is not allowed by the XML specification.
Interestingly enough, comments can appear inside a DTD. In addition,
comments are not considered part of the document’s character data. A
couple of other caveats are that the double-hyphen (--
) cannot be used inside the text of a
comment as the characters -->
are used to indicate that the comment is being closed. Since one of
the goals of XML is to avoid the syntactic difficulties of preceding
markup languages, XML simply does not allow a double-hyphen within the
body of comments. Entities and other markup are not handled within the
text of a comment, so you can use the characters special to the rest
of XML in your comments without worry that they’ll cause syntax errors
in your data. The correct version of the earlier comment element is as
follows:
<book name="Python and XML"> This book is about the Python programming language and XML markup language. <!--comment here--> </book>
By placing the comment inside the element instead of in the start tag, we’ve made it follow the rules.
Processing Instructions (PI) allow an XML document to pass instructions to a handling application. The XML processor does not consider Processing Instructions to be part of the document’s character data. The point of PIs is to hand information to an application. For example, if you are communicating an urgent piece of news and want the receiving application to present some sort of alert to the user, you might place the following instruction within the XML, so that varying applications can act accordingly (i.e., a Palm VII could beep, an X Window application could raise an alert box, and so on):
<?newsAlert title="Martians Invade"?>
In this example, newsAlert
is commonly referred to as the
target; the rest of the text does not have a
special name. The distinction between the two portions of the
processing instruction is entirely a matter of convention; the
specification mandates only the leading <?
, trailing ?>
, and the lack of the character pair
?>
within the PI. (Note that
most of the APIs used to work with PIs refer to the two parts as the
target and the data.) There
is no specific syntax associated with the content of processing
instructions, though it is recommended practice to begin each with a
target (usually the name of the tool expected to
handle it). It is becoming common for applications to expect the
content following the target to look much like a series of attributes
with values, which are commonly referred to as
pseudo-attributes. Clients of this XML document
are able to handle or ignore the PI in whatever way is appropriate to
them. Processing Instructions are useful because they provide an
XML-oriented way of passing events between applications or adding
annotations to the data that are specific to particular applications.
Historically, PIs were used in the SGML community to encode
instructions to formatting applications, with semantics such as “add a
page break here.”
A CDATA section is used to escape special characters in character data in your document. For example:
<![CDATA[The <ool <utter Knife & Sharpening Set]]>
This is actually an encoding of the character data:
The <ool <utter Knife & Sharpening Set
Without using a CDATA section, this must be encoded using general entities or character references:
The <ool <utter Knife & Sharpening Set
The CDATA section is a good way to escape longer stretches of
text that contain many characters that would otherwise be treated as
markup if included directly in the text. Note that a CDATA section
starts with the markup '<![CDATA['
; no whitespace is allowed
around the word CDATA
. Once inside
a CDATA section, no XML syntax is recognized until the characters
']]>'
are encountered. Entity
and character references aren’t resolved or recognized, so the text
­
; does not resolve to the
trademark registration symbol, though it would in normal character
data or in a CDATA
attribute
value.