Chapter 15. The XML language Friendly Tutorial

  • Syntactic details

  • The prolog and the document instance

  • XML declaration

  • Elements and attributes

XML’s central concepts are quite simple, and this chapter outlines the most important of them. Essentially, it gives you what you need to know to actually create XML documents. In subsequent chapters you will learn how to combine them, share text between them, format them, and validate them.

Before looking at actual XML markup (don’t worry, we’ll get there soon!) we should consider some syntactic constructs that will recur throughout our discussion of XML documents. By syntax we mean the combination of characters that make up an XML document. This is analogous to the distinction between sounds of words and the things that they mean. Essentially, we are talking about where you can put angle brackets, quote marks, ampersands, and other characters and where you cannot! Later we will talk about what they mean when you put them together.

After that, we will discuss the components that make up an XML document instance[1]. We will look at the distinction between the prolog (information XML parsers need to know about your document) and the instance (the representation of the actual document itself).

Syntactic details

XML documents are composed of characters from the Unicode character set. Any such sequence of characters is called a string. The characters in this book can be thought of as one long (but interesting) string of text. Each chapter is also a string. So is each word. XML documents are similarly made up of strings within strings.

Natural languages such as English have a particular syntax. The syntax allows you to combine words into grammatical sentences. XML also has syntax. It describes how you combine strings into well-formed XML documents. We will describe the basics of XML’s syntax in this section.

Case-sensitivity

XML is case-sensitive. That means that if the XML specification says to insert the word “ELEMENT”, it means that you should insert “ELEMENT” and not “element” or “Element” or “EIEmEnT”.

So mind your “p’s” and “q’s” and “P’s” and “Q’s”. Our authoritative laboratory testing by people in white coats indicates that exactly 74.5% of all XML errors are related to case-sensitivity mistakes. Of course XML is also spelling-sensitive and typo-sensitive, so watch out for these and other products of human fallibility.

Note that although XML is case-sensitive it is not case-prejudiced. Anywhere that you have the freedom to create your own names or text, you can choose to use upper- or lower-case text, as you prefer.

For instance, when you create your own document types you will be able to choose element-type names. A particular name could be all upper-case (SECTION), all lower-case (section) or mixed-case (SeCtION). But because XML is case-sensitive, all occurrences of a particular element-type name would have to use the same case. It is good practice to create a simple convention such as all lower-case or all upper-case so that you do not have to depend on your memory.

Markup and data

The constructs such as tags, entity references, and declarations are called markup. These are the parts of your document that are supposed to be understood by the XML parser. The parts that are between the markup constitute the character data. While the XML parser rips apart and analyzes markup, it merely passes the character data to the application.

Recall that the parser is the part of the program dedicated to separating the document into its constituent parts. The application is the “rest” of the program. In a word processor, the application is the part that lets you edit the document; in a spreadsheet it is the part that lets you crunch the numbers.

We haven’t explained all of the parts of markup yet, but they are easy to recognize. All of them start with less-than (<) or ampersand (&) characters. Everything else is character data.

White space

There is a set of characters called white space characters that XML parsers treat differently in XML markup. They are the “invisible” characters: space (Unicode/ASCII 32), tab (Unicode/ASCII 9), carriage return (Unicode/ASCII 13) and line feed (Unicode/ASCII 10). These correspond roughly to the spacebar, tab, and Enter keys on your keyboard.

When the XML specification says that white space is allowed at a particular point, you may put as many of these characters as you want in any combination. Just as you might put two lines between paragraphs in a word processor to make a printed document readable, you may put two carriage returns in certain places in an XML document to make your source file more readable and maintainable. When the document is processed, those characters will be ignored.

In other places, white space will be significant. For instance you would not want the parser to strip out the spaces between the words in your document! Thatwouldmakeithardtoread. So white space outside of markup is always preserved.

Names

When you use XML you will often have to give things names. You will name logical structures with element-type names, particular elements with IDs, and so forth. XML names have certain common features. They are not nearly as flexible as character data.

Letters or underscores can be used anywhere in a name. There are thousands of characters that XML version 1.0 considers a “letter” because it includes characters from every language including ideographic ones like Japanese Kanji. XML version 1.1 is even more liberal: it treats a character as a “letter” unless it is from a small list designated as punctuation.[2] Characters that can be used anywhere in a name are known in XML terms as name start characters. They are called this because they may be used at the start of names as well as in later positions.

This implies that there must be characters that can go in a name but cannot be the first character. You may include digits, hyphens and full-stop (.) characters in a name, but you may not start the name with one of them. These are known as name characters. Other characters, like various white space and Western punctuation characters, cannot be part of a name at all. Examples of these non-name characters include the tilde (~), caret (^) and space ( ).

You cannot make names that begin with the string “xml” or some case-insensitive variant like “XML” or “XmL”.

Like almost everything else in XML, names are matched case-sensitively. Names may not contain white space, punctuation or other “funny” characters other than those listed above. The remaining “ordinary” characters (including letters from non-Latin alphabets) are called name characters because they may occur anywhere in a name.

Prolog vs. instance

Most document representations start with a header that contains information about the actual document and how to interpret its representation. This is followed by the representation of the real document.

For instance, HTML has a HEAD element that can contain the TITLE and META elements. After the HEAD element comes the BODY. This is where the representation of the actual document resides. Similarly, email messages have “header lines” that describe who the message came from, to whom it is addressed, how it is encoded, and other things.

An XML document is similarly broken up into two main parts: a prolog and a document instance. The prolog provides information about the interpretation of the document instance, such as the version of XML and the document type to which it conforms. The document instance follows the prolog. It contains the actual document data organized as a hierarchy of elements.

The document instance

The actual content of an XML document goes in the document instance. It is called this because if it has a document type definition or schema definition, it is an instance of the class of documents defined by that DTD or schema. Just as a particular person is an instance of the class of “people”, a particular memo is an instance of the class of “memo documents”.

The formal definition of “memo document” is in the memo DTD or schema definition.

What the tags reveal

Example 15-1 is an example of a small XML document.

Example 15-1. Small XML document

<?xml version="1.0"?>
<!DOCTYPE memo SYSTEM "memo.dtd">
<memo>
<from>
   <name>Paul Prescod</name>
   <email>[email protected]</email>
</from>
<to>
   <name>Charles Goldfarb</name>
   <email>[email protected]</email>
</to>
<subject>Another Memo Example</subject>
<body>
<paragraph>Charles, I wanted to suggest that we
<emphasis>not</emphasis> use the typical memo example in
our book. Memos tend to be used anywhere a small, simple
document type is needed, but they are just
<emphasis>so</emphasis> boring!
</paragraph>
</body>
</memo>

Tree structure

Because a computer cannot understand the data of the document, it looks primarily at the tags, the markup beginning with the less-than and ending with the greater-than symbol. The tags delimit the beginning and end of various elements. The computer thinks of the elements as a sort of tree. It is the XML parser’s job to separate the markup from the character data and hand both to the application.

Figure 15-1 shows a graphical view of the logical structure of the document. The memo element is called either the document element or the root element.

The memo XML document viewed as a tree

Figure 15-1. The memo XML document viewed as a tree

The document element (memo) represents the document as a whole. Every other element represents a component of the document. The from and to elements are meant to indicate the sender and recipient of the memo. The name elements represent people’s names. Continuing in this way, the logical structure of the document is apparent.

Semantics

Experts refer to an element’s real-world meaning as its semantics. In a particular vocabulary, the semantics of a P element might be “paragraph” and in another it might mean “pence”.

If you find yourself reading or writing markup and asking: “But what does that mean?” then you are asking about semantics.

Computers do not yet know anything about semantics. They do not know an HTTP protocol from a supermodel. Vocabulary designers must describe semantics to authors some other way. For instance, they could send email, write a book or make a major motion picture (well, maybe some day).

What the computer does care about is how an element is supposed to look when it is formatted, or how it is to behave if it is interactive, or what to do with the data once it is extracted. These are specified in stylesheets and computer programs.

Elements

XML elements break down into two categories. Most have content, which is to say they contain characters, elements or both, and some do not. Those that do not are called empty elements. Elements within other elements are called subelements.

Elements with content

Example 15-2 is an example of an element with content.

Example 15-2. Simple element

<title>This is the title</title>

Elements with content begin with a start-tag and finish with an end-tag. The “stuff” between the two is the element’s content. In Example 15-2, “This is the title” is the content.

XML start-tags consist of the less-than (<) symbol (“left angle bracket”), the name of the element’s type (sometimes termed a generic identifier or GI), and a greater-than (>) symbol (“right angle bracket”). Start-tags can also include attributes. We will look at those later in the chapter. The start-tag in Example 15-2 is <title> and its element-type name is “title”.

XML end-tags consist of the string “</”, the same generic identifier (or GI) as in the start-tag, and a greater-than (>) symbol. The end-tag in Example 15-2 is </title>.

You must always repeat the generic identifier in the end-tag. This helps you to keep track of which end-tags line up with which start-tags. If you ever forget one or the other, the parser will know immediately, and will alert you that the document is not well-formed.

Note that less-than symbols in content are always interpreted as beginning a tag. If the characters following them would not constitute a tag, then the document is not well-formed.

Caution

Caution

The word “tag” is often used imprecisely, sometimes to mean “element-type name”, sometimes “element type”, and sometimes even “element”. XML tags always start with less-than symbols and end with greater-than symbols. Nothing else is a tag. DTDs and schemas do not define tags, they define element types. (See 20.2, “Tag vs. element”, on page 431 for an illustrated explanation.)

Empty elements

It is possible for an element to have no content at all. Such an element is called an empty element. One way to denote an empty element is to merely leave out the content. But as a shortcut, empty elements may also have a different syntax. Because there is no content to delimit, they may consist of a single empty-element tag. That looks like this: <MyEmptyElementTag/>.

The slash at the end indicates that this is an empty-element tag, so there is no content or end-tag coming up. The slash is meant to be reminiscent of the slash in the end-tag of an element with both tags. This is just a shortcut syntax. XML parsers do not treat empty-element tags differently from elements that merely have no content between the start- and end-tag.

Usually empty elements have attributes. Occasionally an empty element without attributes will be used to flag a particular location in a document. Example 15-3 is an example of an empty element with an attribute.

Example 15-3. Empty element with attribute

<EMPTY-ELEMENT ATTR="ATTVAL"/>

Remember what the slash at the end means! You will see it often and it is easy to miss when there are attributes like this. The slash indicates that this is an empty element so that the parser need not look for a matching end-tag.

Summary

In summary, elements are either empty or have content. Elements with content are represented by a start-tag, the content, and an end-tag. Empty elements can either have a start-tag and end-tag with nothing in between, or a single empty-element tag. An element’s type is always identified by the generic identifiers in its tags.

The reason we distinguish element types from generic identifiers is because the term “generic identifier” refers to the syntax of the XML document – the characters that represent the actual document. The term “element type” refers to a property of a component of the actual document.

Attributes

In addition to content, elements may have attributes. Attributes are a way of attaching characteristics or properties to elements of a document. Attributes have names, just as real-world properties do. They also have values. For instance, two possible attributes of people are their “shoe size” and “IQ” (the attribute’s names), and two possible values are “12” and “112” (respectively).

In a DTD or schema definition, each attribute is defined for a specific element type and is allowed to exhibit a certain type of value. Multiple element types could provide attributes with the same name and it is sometimes convenient to think of them as the “same attribute” even though they technically are not.[3]

Attributes have semantics also. They always mean something. For example, an attribute named height might be provided for person elements (allowed occurrence), exhibit values that are numbers (allowed values), and represent the person’s height in centimeters (semantics).

Here is how attributes of person elements might look:

Example 15-4. Elements with attributes

<person height="165cm">Dale Wick</person>
<person height="165cm" weight="165lb">Bill Bunn</person>

As you can see, the attribute name does not go in quotes, but the attribute value does because it is a literal string.

Literal strings

The data (text other than markup) can contain almost any characters. Obviously, in the content of your document you need to use punctuation and white space characters! But sometimes you also need data characters within markup. For instance, an element might represent a hyperlink and need to have a URL attribute.

Literal strings allow users to use (non-name) data characters within markup. For instance, to specify the URL in the hyperlink, we would need the slash character. Example 15-5 is an example of such an element.

Example 15-5. Literal string in attribute value

<REFERENCE URL="http://www.documents.com/document.xml">...
</REFERENCE>

The string that defines the URL is the literal string. This one starts and ends with double quote characters. Literal strings are always surrounded by either single or double quotes. The quotes are not part of the string. For example, see Example 15-6.

Example 15-6. Quotes within quotes

"This is a double quoted literal."
'This is a single quoted literal.'
"'tis another double quoted literal."
'"And this is single quoted" said the self-referential example.'

ID and IDREF attributes

Sometimes it is important to be able to give a name to a particular occurrence of an element type; that is, to a single element. For instance, to make a simple hypertext link or cross-reference from one element to another, you can name a particular section or figure. Later, you can refer to it by its name.

The target element is labeled with an ID attribute. The other element refers to it with an IDREF attribute. This is shown in Example 15-7.

Example 15-7. Using ID and IDREF attributes

<BOOK>
...
<SECTION ID="Why.XML.Rocks"><TITLE>Features of XML</TITLE>
...
</SECTION>

...
If you want to recall why XML is so great, please see
the section entitled <CROSS-REFERENCE IDREF="Why.XML.Rocks"/>.
...
</BOOK>

Caution

Caution

You may see an element-type name, such as SECTION in the above example, referred to as an element name. The real element name – the name of this individual SECTION element – is the value of the element’s ID attribute; in this case, Why.XML.Rocks (See 20.2, “Tag vs. element”, on page 431.)

The prolog

XML documents may start with a prolog that describes the XML version (“1.0” or “1.1”, for now), document type, and other characteristics of the document.

The prolog is made up of an XML declaration and a document type declaration, both optional.

Caution

Caution

Some Microsoft Office applications require an XML declaration in order to recognize and process the document as XML.

The XML declaration must precede the document type declaration if both are provided. Also, comments, processing instructions, and white space can be mixed in among the two declarations. The prolog ends when the first start-tag begins.

Example 15-8 is a simple prolog.

Example 15-8. A simple prolog

<?xml version="1.0"?>
<!DOCTYPE book SYSTEM "http://www.oasis-open.org/.../docbookx.dtd">

This prolog says that the document conforms to XML version 1.0 and declares adherence to a particular document type, book.

XML declaration

The XML declaration has three parts; the last two are optional. A minimal XML declaration looks like this:

Example 15-9. Minimal XML declaration

<?xml version="1.0">

Example 15-10 is a more expansive one, using all of its parts.

Example 15-10. More expansive XML declaration

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

Although the parts have the same syntax as attributes, there is an important difference. The parts are strictly ordered whereas attributes can be specified in any order.

The third part, the standalone document declaration, is rarely used and arguably useless; we’ll say no more about it!

Version info

The version info part of the XML declaration declares the version of XML that is in use. It is required in all XML declarations, although the XML declaration itself is optional. At the time of writing, the only permitted version strings are “1.0” and “1.1”. If you leave out the entire XML declaration (thereby leaving out the version) then your document is presumed to be an XML 1.0 (not 1.1) document.

The XML version information is part of a general trend towards information representations that are self-identifying. This means that you can look at an XML document and (if it has the declaration) know immediately both that it is XML and what version of XML it uses. As more and more document representations become self-identifying, we will be able to stop relying on error-prone identification schemes like file extensions.

Encoding declaration

The encoding declaration part describes the encoding that is used. If omitted, it defaults to a Unicode encoding called UTF-8 which incorporates the commonly-used 7-bit ASCII. Therefore, you need only use the encoding declaration for a national or regional encoding like Russia’s KOI8-R, Western Europe’s ISO-8859-1 or Japan’s Shift-JIS, as shown in Example 15-11.

Example 15-11. Encoding declaration

<?xml version="1.0" encoding="KOI8-R"?>

Document type declaration

After the XML declaration (if present) and before the first element, there may be a document type declaration which declares the document type that is in use in the document. A “book” document type, for example, might be made up of chapters, while a letter document type could be made up of element types such as ADDRESS, SALUTATION, SIGNATURE, and so forth.

The document type declaration is at the heart of the concept of validity, which makes applications based on XML robust and reliable. It includes the markup declarations that express the document type definition (DTD).

The DTD is a formalization of the intuitive idea of a document type. The DTD lists the element types available and can put constraints on the occurrence and content of elements and other details of the document structure. This makes an information system more robust by forcing the documents that are part of it to be consistent.

A schema definition can also be used for this purpose, but other means must be used to associate it with documents. There are several schema languages; we discuss the official W3C one in Chapter 22, “XML Schema (XSDL)”, on page 466.

Entities: Breaking up is easy to do

XML allows flexible organization of document text. The XML constructs that provide this flexibility are called entities. They allow a document to be broken up into multiple storage objects and are important tools for reusing and maintaining text. Entities are used in many publishing-oriented applications of XML but are much less common in machine-to-machine applications.

In simple cases, an entity is like an abbreviation in that it is used as a short form for some text. We call the “abbreviation” the entity name and the long form the entity content. That content could be as short as a character or as long as a chapter. For instance, in an XML document, the entity XSL could have the phrase “Extensible Style Language” as its content. Using a reference to that entity is like using “XSL” as an abbreviation for that phrase – the parser replaces the reference with the content.

You create the entity with an entity declaration. Example 15-12 is an entity declaration for an abbreviation.

Example 15-12. Entity used as an abbreviation

   <!ENTITY XSL "Extensible Style Language">
]>

Like other markup declarations, entity declarations occur in the document type declaration section of the document prolog (Example 15-13).

Example 15-13. Entity declarations occur in the document type declaration

<!DOCTYPE mydoc ...[
  <!ENTITY XSL "Extensible Style Language">
  ...other markup declarations ...
]>

Note

Note

You can use entities with schemas. In that case your “DTD” would consist solely of declarations needed for the entities.

Entities can be much more than just abbreviations. Another way to think of an entity is as a box with a label. The label is the entity’s name. The content of the box is some sort of text or data. The entity declaration creates the box and sticks on a label with the name. Sometimes the box holds XML text that is going to be parsed (interpreted according to the rules of the XML notation), and sometimes it holds data, which should not be.

Parsed entities

If the content of an entity is XML text that the parser should parse, the XML spec calls it a parsed entity.

If the content of an entity is data that is not to be parsed, the XML spec calls it an unparsed entity.

The abbreviation in Example 15-12 is a parsed entity. Parsed entities, being XML text, can also contain markup. Example 15-14 is a declaration for a parsed entity with some markup in it.

Example 15-14. Parsed entity with markup

<!ENTITY XSL "<title>Extensible Style Language</title>">

Because the entity content in the example is in the entity declaration, the entity is called an internal entity. Only parsed entities can be internal entities.

External entities

The parser can also fetch content from somewhere on the Web and put that into the box. This is an external entity. For instance, it could fetch a chapter of a book and put it into an entity. This would allow you to reuse the chapter between books. Another benefit is that you could edit the chapter separately with a sufficiently intelligent editor. This would be very useful if you were working on a team project and wanted different people to work on different parts of a document at once. Example 15-15 illustrates.

Example 15-15. External entity declaration

<!ENTITY intro-chapter SYSTEM "http://www.megacorp.com/intro.xml">

Entities also allow you to edit very large documents without running out of memory. Depending on your software and needs, either each volume or even each article in an encyclopedia could be an entity.

Entity references

An author or DTD designer refers to an entity through an entity reference. The XML parser replaces the reference by the content, as if it were an abbreviation and the content was the expanded phrase. This process is called entity inclusion or entity replacement. After the operation we say either that the entity reference has been replaced by the entity content or that the entity content has been included.

Which you would use depends on whether you are talking from the point of view of the entity reference or the entity content. The content of parsed entities is called their replacement text.[4]

Example 15-16 is an example of a parsed entity declaration and its associated reference.

Example 15-16. Entity declaration

<!DOCTYPE MAGAZINE[
...
<!ENTITY title "Hacker Life">
...
]>
<MAGAZINE>
<TITLE>&title;</TITLE>
...
<P>Welcome to the introductory issue of &title;. &title; is
geared to today's modern hacker.</P>
...
</MAGAZINE>

Anywhere in the document instance that the entity reference “&title;” appears, it is replaced by the text “Hacker Life”. It is just as valid to say that “Hacker Life” is included at each point where the reference occurs.

The ampersand character starts all general entity references and the semicolon ends them. The text between is an entity name.

How entities are used

Here are some examples of what you can do with entities:

  • You could store every chapter of a book in a separate file and link them together as entities.

  • You could “factor out” often-reused text, such as a product name, into an entity so that it is consistently spelled and displayed throughout the document.

  • You could update the product name entity to reflect a new version. The change would be instantly visible anywhere the entity was used.

  • You could create an entity that would represent “legal boilerplate” text (such as a software license) and reuse that entity in many different documents.

Note

Note

We have explained only the basics about entities. For the full story, see the XML Recommendation or The XML Handbook.

Character references

It is not usually convenient to type in characters that are not available on the keyboard. With many text editors, it is not even possible to do so. XML allows you to insert such a character with a character reference.

If, for instance, you wanted to insert a character from the “International Phonetic Alphabet”, you could spend a long time looking for a combination of keyboard, operating system and text editor that would make that straightforward. XML simply allows you to refer to the character by its Unicode number.

Reference by decimal number

Here is an example:

Example 15-17. Decimal character reference

<P>Here is a reference to Unicode character 161: &#161;.</P>

Unicode is a character set. The character numbered 161 in Unicode happens to be the inverted exclamation mark.

Reference by hexadecimal number

Alternatively, you could use the hexadecimal (hex) value of the character number to reference it:

Example 15-18. Hex character reference

<P>Here is a different reference to Unicode character 161: &#xA1;.

Hex is a numbering system often used by computer programmers that translates naturally into the binary codes that computers use. The Unicode Standard book uses hex, so those that have that book will probably prefer this type of character reference over the other (whether they are programmers or not).

Note that character references are not entity references, though they look similar to them. Entities have names and values, but character references only have character numbers. In an XML document, all entities except the predefined ones must be declared. But a character reference does not require a declaration; it is just a really verbose way to type a character (but often the only way).

Reference by name (via an entity)

Because Unicode numbers are hard to remember, it is often useful to declare entities that stand in for them:

Example 15-19. Entity declaration for a Unicode character

<!ENTITY inverted-exclamation "&#161;">

Suppressing markup recognition

Sometimes when you are creating an XML document, you want to protect certain characters from being interpreted as markup. Imagine, for example, that you are writing a user’s guide to HTML. You would need a way to include an example of markup. Your first attempt might be to create an example element and do something like Example 15-20.

Example 15-20. An invalid approach to HTML examples in XML

<p>HTML documents must start with a DOCTYPE, etc. etc. This
is an example of a small HTML document:
<sample>

  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
  <HTML>
  A document's title
  <H1>A document's title</H1>
  </HTML>

</sample>

This will not work, however, because the angle brackets that are supposed to represent HTML markup will be interpreted as if they belonged to the XML document you are creating, not the mythical HTML document in the example. Your XML parser will complain that it is not appropriate to have an HTML DOCTYPE declaration in the middle of an XML document!

There are two solutions to this problem: CDATA sections and predefined entities.

CDATA sections

A construct called a CDATA section allows you to ask the parser to suspend markup recognition in a large chunk of text: “Hands off! This isn’t meant to be interpreted.”

CDATA stands for “character data”. You can mark a section as being character data using the syntax shown in Example 15-21.

Example 15-21. Writing about HTML in a CDATA section

<![CDATA[
<HTML>
This is an example from HTML for Dumbbells!
<p>It may be a pain to write a book about HTML in HTML,
but it is easy in XML!
</HTML>
]]>

The first and last lines mark the start and end, respectively, of the CDATA section. The last line is a delimiter called CDEnd (]]>). It may only be used to close CDATA sections. It must not occur anywhere else in an XML document.

Predefined entities

Predefined entities allow an author to represent individual data characters that would otherwise be interpreted as markup. There are five of them, shown in Table 15-1, along with the markup interpretations that they avoid.

Table 15-1. Predefined entities

Entity reference

Character

Markup not recognized

&amp;

&

Entity or character reference

&lt;

<

Tag

&gt;

>

CDend

&apos;

'

Literal

&quot;

"

Literal

We can use references to the predefined entities to insert these characters, instead of typing them directly. Then they will not be interpreted as markup. Example 15-22 demonstrates this.

Example 15-22. Writing about HTML with predefined entities

<p>HTML documents must start with a DOCTYPE, etc. etc. This
is an example of a small HTML document:
<sample>
   &lt;!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
   &lt;HTML>
   &lt;HEAD>
   &lt;TITLE>A document's title
   &lt;/TITLE>
   &lt;/HEAD>
   &lt;/HTML>
</sample>

When your XML parser parses the document, it will replace the entity references with actual characters. It will not interpret the characters it inserts as markup, but as “plain old data characters” (character data).

Predefined entities and CDATA sections only relate to the interpretation of the markup, not to the properties of the real document that the markup represents.

Comments

Sometimes it is useful to embed information about a document or its markup in a manner that will be ignored by computer processes and renditions of the document. For example, you might insert a note to yourself to clean up the wording of a section, a note to a co-author explaining the reason for a particular section of the document, or a note in a DTD describing the semantics of a particular element. This information can be hidden from the application in a comment. Comments should never be displayed in a browser, indexed in a search engine, or otherwise processed as part of the data of the actual document.

Example 15-23. A comment

<!-- This section is really good! Let's not change it. -->

Comments consist of the characters “<!--” followed by almost anything and ended by “-->”. The “almost anything” in the middle cannot contain the characters “--”. This is a little bit inconvenient, because people often use those two characters as a sort of dash, to separate thoughts. This is another point to be careful of, lest you get bitten.

Comments can go just about anywhere in the instance or the prolog. However, they are not permitted within declarations, tags, or other comments.

Markup is not recognized in comments. You can put less-than and ampersand symbols in them, but they will not be recognized as the start of elements or entity references.

Processing instructions

An XML comment is for those occasions where you need to say something to another human being without reference to the DTD or schema, and without changing the way the document looks to readers or applications. A processing instruction (PI) is for those occasions where you need to say something to a computer program without reference to the DTD or schema and without changing the way that the document is processed by other computer programs. This is only supposed to happen rarely.

Processing instructions start with a fixed string “<?”. That is followed by a name and, after that, any characters except for the string that ends the PI, “?>”. The XML declaration shown in Example 15-10 is an example.

The name at the beginning of the PI is called the PI target. This name would typically be specified in the documentation for the tool or specification. In this case, the PI target is the XML processor itself.

After the PI target comes white space and then some totally proprietary command. This command is not parsed in the usual way at all. Characters that would usually indicate markup are totally ignored. The command is passed directly to the application and it does what it wants to with it.

The command ends when the processor sees the string “?>”. There is absolutely no standard for the characters in the middle. PIs could use attribute syntax for convenience, as the XML declaration does, but they could also choose not to.

Processing instructions are appropriate when you are specifying information about a document that is unrelated to its structure and content. As we describe in 18.10, “Referencing XSLT stylesheets”, on page 412, XSL provides a processing instruction for associating stylesheets. Similarly, Office uses a PI to determine which application to use to open an XML document:

<?mso-application progid="Word.Document"?>

Note that this sort of processing instruction does not really add anything to the content or structure of the document. It says something about how to process the document. It says: “This document has an associated stylesheet (or application).“

It is not always obvious what is abstract information and what is merely processing information. If your information must be embedded in documents of many types, or with DTDs or schemas that you cannot change, then processing instructions are typically the appropriate technique.

Office support for the XML language

Office can read and write XML documents that have declarations and other markup besides tags. However, most markup cannot be entered directly; the user interfaces of the products must be used, and they do not support some features of the XML language.

These support characteristics (shown in Table 15-2) may be significant when working with XML documents that are generated or processed by systems other than Office.

Table 15-2. Degree of support for XML language constructs

Construct

Open and Import

User

Save and Export

Tags

Full: includes tags with attributes and empty-element tags

Full

Full: all tags read in or created through user interface

User-defined parsed entities

Full: references to internal and external entities replaced with entity content

None

None: entity content merged into document

Processing instructions

Full: PIs allowed anywhere

None

Partial: saves PIs that precede root element

Predefined entities

Full: entity references replaced with their data characters

No need

Full: prohibited data characters converted to entity references

Character references

Full: entity references replaced with their data characters

None

None: characters merged into document; declarations not saved

CDATA sections

Full: markup in section content treated as data

None

None: section content merged into document; protected with predefined entity references

DTDs

Full (non-validating): default attribute values are processed

None

None: DTD not saved

Comments

Full: comments allowed anywhere

None

None: comments not saved

In many cases, lack of support for a construct can be overcome by simple generic transforms when opening and saving the document.

For example, an opening transform that tagged the CDATA sections with, for example, <saveMarkup:CDATA> tags and end-tags would preserve the section boundaries. The original markup could then be restored by transforming the document when it is saved: replacing the tags with the CDATA start and end.

Entity references could be handled similarly by adding an attribute for the system identifier URL, so that the closing transform can restore the entity declaration.

Summary

An XML document is composed of a prolog and a document instance. The prolog is optional, and provides information about how the document is structured both physically (where its parts are) and logically (how its elements fit together). Elements and attributes describe the logical structure while entities describe the physical structure.



[1] Roughly, what the XML spec calls the “root element”.

[2] The two versions differ only in some character set details, which is why XML 1.1 hasn’t been mentioned before.

[3] Unless they are in the same namespace, a situation we discuss in Chapter 16, “Namespaces”, on page 376.

[4] If you are a programmer, you might think of entities as macros and call the process entity expansion.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset