As discussed earlier, Document Type Definitions, or DTDs, are the form of document types specified by the XML 1.0 recommendation. Though there are alternatives, DTDs remain one of the most common ways of specifying a document type. In this section, we discuss the syntax of the various declarations that can occur in the Document Type Declaration; these can all appear in both the internal and external subsets.
Entities are sources of data that are used to compose a
larger construct. Most, called general entities,
are used to construct documents, but some, known as
parameter entities, are used to construct the
document type itself. Both are defined using an entity
declaration in the Document Type Definition. Each kind of
entity is defined in a separate namespace; there can be a general
entity named myEntity
and a
parameter entity of the same name, and the names do not clash.
Entities can be declared more than once — the first definition for a name takes precedence. This allows the internal subset to override a definition provided in the external subset; when used with parameter entities, this mechanism can be used to extend DTDs. Document type extension generally works best when the DTD being extended has been carefully designed with this in mind. The DocBook DTD for technical documentation is an excellent example of this.
General entities can take a variety of forms: they may be parsed entities, consisting of XML text, or unparsed, such as an image stored as a Portable Network Graphics (PNG) file. The text of a parsed entity may be included in the entity declaration, or it may reside in an external source. The body of an unparsed entity is always stored externally. Most entities used with XML are parsed entities; unparsed constructs, such as images, are typically referenced using an absolute or relative URL rather than by a named entity.
Parsed general entities are used to define substitution text for a (typically) shorter name. Recall that in XML, text includes not only character data, but markup as well, so the substitution can actually insert additional structure into the document as long as all structures are complete within the substitution. At production time, a parser resolves the entity into its substitution text, and evaluates the document based on how it looks after the entities have been resolved. A simple internal entity is as easy to create as a symbol and its replacement text:
<!ENTITY sandwich "Crabby Patty">
In your document, any reference to &sandwich
; yields the replacement text
of “Crabby Patty” into the document. For example:
I am hungry for a &sandwich;.
This sentence renders as:
I am hungry for a Crabby Patty.
External entities are defined using an entity declaration that gives a URL to an external resource containing the replacement text:
<!ENTITY legal SYSTEM "http://www.example.com/legal.xml">
Any reference to &legal;
within a document yields:
<legal>Copyright 2001, Example Corporation</legal>.
Like internal entities, external entities replace symbols with
the appropriate text. Sometimes this must be done when the text uses
characters that would otherwise be considered markup (such as the use
of special characters like <
,
>
, and &
in your XML). Other times, entities
are used to keep boilerplate information that is normally maintained
somewhere else available to the document.
Parameter entities are different in both usage and applicability. They can only be used to create the Document Type Definition, and not to directly compose the document. The syntax of an XML document does not allow parameter entities to be referred to from within the document content, but only allows their use with the internal and external DTD subsets. There are no unparsed parameter entities, though a nonvalidating parser may ignore them. Validating parsers are required to parse all referenced parameter entities.
The declaration for a parameter entity looks much like the declaration for a general entity, with just a couple of additional characters added:
<!ENTITY % node-decls SYSTEM "node-decls.dtd">
What this declaration has that the general entity declaration
doesn’t is a percent sign (%
)
between the keyword ENTITY
and the
name of the entity, with whitespace on both sides to set it off (the
whitespace is required). This parameter entity would be used like
this:
%node-decls;
Note that the reference to the parameter entity uses the percent sign instead of the ampersand to mark the beginning of the name; this is necessary since the two sets of names may overlap.
The effect of entity replacement is much like the use of general entities. The replacement text effectively replaces the entity reference, and interpretation of the document type continues using the modified text.
The usefulness of parameter entities is highest when working with modularized document types, which can provide carefully designed extension mechanisms using parameter entities. A large DTD, such as the industry-standard DocBook DTD for software documentation, can be customized by creating a new document type that simply defines several parameter entities and then incorporates the standard DocBook definition. Since the entity declarations in the customization layer override the definitions provided by DocBook, this mechanism can be used to either extend or restrict the specific document type in ways that are suitable for a specific project.
Element type declarations are used to constrain an element’s content. They indicate what element types can be used as children of the element, and show how the children may be arranged. Element type declarations may look like this:
<!ELEMENT br EMPTY> <!ELEMENT generic ANY> <!ELEMENT name (address+)> <!ELEMENT para (#PCDATA | list | picture)*>
We can break up the declaration into particular systactic components, each with a specific purpose:
<!ELEMENT name content-model
>
The text <!ELEMENT
tells
the parser that this is an element type declaration.
name
gives a name to the element type; this
allows it to be referenced from elsewhere in the Document Type
Definition. The content-model
is used to
specify what can appear as content of the element, whether it can
contain character data, other elements, or both. No element type may
be declared more than once.
It is interesting to note that there is not a place for attributes to be declared. While attributes are associated with element types, they are defined using attribute declarations, described later in this chapter, in Section 2.6.3.
A content model describes what elements are allowed as children of the declared element type, in what order and combination they are allowed, and whether arbitrary character data is allowed.
The content models of all elements can be broken into two categories:
This describes content made up only of elements. That is, you define an address element that requires no character data, but instead requires child elements. The specification defines content particles that “consist of names, choice lists of content particles, or sequence lists of content particles.”
This content may contain character data. This is the most common arrangement in text documents:
<news title="XML from Outer Space"> This article describes XML transmissions from outer space. <h1>Not a Meteor</h1> <para>Contrary to earlier reports, the XML that has landed from outer space is not a meteor.</para> </news>
In this example, elements and character data are mixed
beneath the news
element.
Elements that have a mixed content model are not required to
allow other elements as content. In fact, an element type with
only character data in the content model may be completely
empty; there is no way to specify that there must be
characters in the character data.
Let’s take another look at our example element declarations:
<!ELEMENT br EMPTY>
These element type declarations are simple. The content model
of the first, EMPTY
, can be used
to describe an empty br
element
as found in XHTML. It can contain no child elements and no character
data. It can still contain noncontent constructs, such as comments
or processing instructions. An element type declared as EMPTY
is considered a degenerate special
case of element content.
<!ELEMENT generic ANY>
Next, we have an element named generic
that can contain any kind of
element defined in the document type (this does
not allow undefined element types!). In
addition to other elements, character data is allowed as well, so a
content model of ANY
is mixed
content.
<!ELEMENT name (address+)>
The third example is simple, but very different from the
others. Instead of a simple name such as ANY
or EMPTY
, the model is described by something
that closely resembles a regular expression. In this particular
example, we have a name
element
that requires one or more address
elements to be included. This form of content model is perhaps the
most commonly used and allows for fine control. Content models can
take on varying levels of complexity, but the goal is always the
same: to define the content that is allowed or expected within the
element.
The content model is specified with parentheses, as well as
with commas indicating a sequence. Vertical bar characters (|
) indicate a choice. For example:
<!ELEMENT name (first, last)>
This element type requires a first
child element followed by a last
child element, and nothing else. If
you want to offer a choice between first
or last
, but not allow both, use a vertical
bar:
<!ELEMENT name (first | last)>
These expressions can be nested within each other as well:
<!ELEMENT order (sku, quantity, (account | name), price)>
The above order
element
requires a child sku
element,
followed by a quantity
element,
then followed by either an account
or a name
element, and finally followed by a
price
element.
Additionally, the operators +
, *
,
and ?
can be tacked onto the end
of content expressions to indicate the number of times an element or
sequence must occur, or whether it is repeatable or even required.
Without a modifier, the element must appear exactly once in that
location. They are explained in the following list:
+
Content must appear one or more times.
*
Content may appear zero or more times.
?
Content may appear zero times or one time.
For example, to require an order
element to have only one account
, followed by at least one or more
skus
, contain one or more
price
elements, and optionally
provide a shipping address (ship
)
once only,
you could use an
Element type such as the following:
<!ELEMENT order (account, sku+, price+, ship?)>
To mix a combination of character data or elements, you can
use the or
operator to specify
your mixed content, as shown here:
<!ELEMENT paragraph (#PCDATA | list | picture)*>
This paragraph
element type
allows for repeatable sequences of character data (denoted by the
asterisk), list
elements, or
picture
elements within paragraph
elements. #PCDATA
can only be combined with elements
using the or
operator in a group
that has a *
modifier, and it can
only occur in the outermost parenthesized group of a content
model.
As discussed earlier, attributes are used to provide name/value combinations as properties of elements. Attributes can appear only in start tags and empty element tags. An attribute-list declaration would be a part of a DTD, used to validate the XML document. An example follows:
<!ATTLIST news title CDATA #REQUIRED author CDATA #IMPLIED>
This is an attribute-list declaration that indicates that any
news
element is required to have a
title
attribute consisting of
character data, and may optionally have an author
attribute, also consisting of
character data.
The specification states that attribute
types are of three kinds: string,
tokenized, and enumerated.
In the earlier attribute list example, you saw that a news
element required a title
attribute with the string type
CDATA
.
There are several tokenized attribute types:
ID
A unique identifier for this element. The identifier must be a name unique in the current document instance.
IDREF
Must match an ID
somewhere in the XML document.
IDREFS
A list of one or more names, separated by spaces. Each
must match an ID
in the
document.
ENTITY
Matches the name of an unparsed entity declared in the document.
ENTITIES
A space-separated list containing one or more entity names.
NMTOKEN
The most seldom used, this matches an NMTOKEN
production as defined in the
XML recommendation; refer to the recommendation for more
information.
NMTOKENS
A list of one or more space-separated NMTOKEN
values; this is the least
used attribute type.
The remaining attribute types, the enumerated types, are defined in the attribute list itself. An enumerated type is a type that takes a name from a defined list of names, in which the list is given in an attribute declaration. Each distinct set of names forms a separate type, but these types do not have names of their own. An example should help clarify this:
<!ATTLIST ship type (sloop | frigate | dinghy) #IMPLIED>
This declaration defines an attribute type
that may have a value of dinghy
, frigate
, or sloop
, but no other value. The element
<ship type="yacht"/>
would
trigger a validation failure.
An attribute declaration allows the document type to specify a default value for an attribute if the attribute is missing. It can also indicate whether the attribute may be omitted from the document. Let’s look at a more interesting example of an attribute declaration:
<!ATTLIST chapter synopsis CDATA #IMPLIED author CDATA #REQUIRED email CDATA "[email protected]" version CDATA #FIXED "1.0" type (normal|reference|appendix) "normal">
The synopsis
attribute is required to be a string (CDATA
) if it is given at all, but it is
not required, and does not have a default value because it is marked
as #IMPLIED
. (Most of the
attributes in HTML are declared this way.) The #REQUIRED
constraint means just what it
says; the author
attribute must
be specified in the document. Because it is a string, it may be
empty. If a string value is specified instead of #IMPLIED
or #REQUIRED
, as with the email
attribute in our example attribute
list, it becomes the default value that is used if no value is given
in the document.
The #FIXED
constraint can only be used in conjunction with a default value,
which we see for the version
attribute. When this constraint is used, the document is allowed to
include the attribute, but the value must match that given by the
default exactly, though it may be encoded using a different mixture
of characters, entity references, and character references. If the
value differs, an error is reported by the parser.
The type
attribute is an
example of an enumerated type, similar to what we looked at earlier.
Default values and constraints are specified for enumerated types in
the same way as for other types, with the additional constraint that
if a value is specified, it must be one of the names included in the
enumeration.
ID attributes offer some unique behavior. Let’s create an attribute for the news element we defined previously:
<!ATTLIST news newsID ID #REQUIRED>
With this attribute list, news
elements are required to have a
newsID
attribute. The allowed
values are governed by the rules of the ID
tokenized type. Specifically, the
ID
value is a name (as defined in
this chapter in Section
2.5.2.1) and must not appear more than once in an XML
document as the value of any attribute of type ID
. In other words, ID
values must uniquely represent an
element within the document. Consider a legal example:
<news newsID="id39">Text</news> <news newsID="id40">Text</news>
Since the values of ID
attributes are required to be unique within a document, the
following is illegal:
<news newsID="id39">Text</news> <news newsID="id39">Text</news>
Additionally, no element may have more than one ID
attribute specified. An element type
may define more than one attribute of the ID
type, but at most, one ID
value may be specified for any element.
As a result, some of the programming APIs can use the values of
ID
attributes to retrieve
specific elements from a document.
What is most interesting about ID
attributes, however, is not the
attributes themselves, but the IDREF
attribute type. While a particular
value may only appear once in a document as an ID
type, it may appear any number of times
as the value of an IDREF
or
IDREFS
attribute. In particular,
attributes of those types may only take values
that also appear as the value of an ID
attribute somewhere in the same
document. (An IDREFS
attribute
can take a value that is a space-separated list of ID
values, each of which must exist in the
document.) These values can be used to forge internal links between
elements that a validating parser must check. This can be very
convenient when a basic tree structure is not sufficient to model
your data; the ID
, IDREF
, and IDREFS
attributes can be used to extend
the data model to include connected, directed graphs with typed
arcs.