2.5. Entities: Placeholders for Content

With the basic parts of XML markup defined, there is one more component we need to look at. An entity is a placeholder for content, which you declare once and can use many times almost anywhere in the document. It doesn't add anything semantically to the markup. Rather, it's a convenience to make XML easier to write, maintain, and read.

Entities can be used for different reasons, but they always eliminate an inconvenience. They do everything from standing in for impossible-to-type characters to marking the place where a file should be imported. You can define entities of your own to stand in for recurring text such as a company name or legal boilerplate. Entities can hold a single character, a string of text, or even a chunk of XML markup. Without entities, XML would be much less useful.

You could, for example, define an entity w3url to represent the W3C's URL. Whenever you enter the entity in a document, it will be replaced with the text http://www.w3.org/.

Figure 2.13 shows the different kinds of entities and their roles. The two major entity types are parameter entities and general entities. Parameter entities are used only in DTDs, so we'll describe them in Chapter 5. In this section, we'll focus on the other type, general entities. General entities are placeholders for any content that occurs at the level of or inside the root element of an XML document.

Figure 2.13. Taxonomy of entities

An entity consists of a name and a value. When an XML parser begins to process a document, it first reads a series of declarations, some of which define entities by associating a name with a value. The value is anything from a single character to a file of XML markup. As the parser scans the XML document, it encounters entity references, which are special markers derived from entity names. For each entity reference, the parser consults a table in memory for something with which to replace the marker. It replaces the entity reference with the appropriate replacement text or markup, then resumes parsing just before that point, so the new text is parsed too. Any entity references inside the replacement text are also replaced; this process repeats as many times as necessary.

Figure 2.14 shows that there are two kinds of syntax for entity references. The first, consisting of an ampersand (&), the entity name, and a semicolon (;), is for general entities. The second, distinguished by a percent sign (%) instead of the ampersand, is for parameter entities.

Figure 2.14. Syntax for entity references

The following is an example of a document that declares three general entities and references them in the text:

<?xml version="1.0"?>
<!DOCTYPE message SYSTEM "/xmlstuff/dtds/message.dtd"
[
  <!ENTITY client "Mr. Rufus Xavier Sasperilla">
  <!ENTITY agent "Ms. Sally Tashuns">
  <!ENTITY phone "<number>617-555-1299</number>">
]>
<message>
<opening>Dear &client;</opening>
<body>We have an exciting opportunity for you! A set of 
ocean-front cliff dwellings in Pi&#241;ata, Mexico have been
renovated as time-share vacation homes. They're going fast! To 
reserve a place for your holiday, call &agent; at &phone;. 
Hurry, &client;. Time is running out!</body>
</message>

The entities &client;, &agent;, and &phone; are declared in the internal subset of this document and referenced in the <message> element. A fourth entity, &#241;, is a numbered character entity that represents the character ñ. This entity is referenced but not declared; no declaration is necessary because numbered character entities are implicitly defined in XML as references to characters in the current character set. (For more information about character sets, see Chapter 7.) The XML parser simply replaces the entity with the correct character.

The previous example looks like this with all the entities resolved:

<?xml version="1.0"?>
<!DOCTYPE message SYSTEM "/xmlstuff/dtds/message.dtd">
<message>
<opening>Dear Mr. Rufus Xavier Sasperilla</opening>
<body>We have an exciting opportunity for you! A set of 
ocean-front cliff dwellings in Piñata, Mexico have been
renovated as time-share vacation homes. They're going fast! To 
reserve a place for your holiday, call Ms. Sally Tashuns at
<number>617-555-1299</number>.
Hurry, Mr. Rufus Xavier Sasperilla. Time is running out!</body>
</message>

All entities (besides predefined ones) must be declared before they are used in a document. Two acceptable places to declare them are in the internal subset, which is ideal for local entities, and in an external DTD, which is more suitable for entities shared between documents. If the parser runs across an entity reference that hasn't been declared, either implicitly (a predefined entity) or explicitly, it can't insert replacement text in the document because it doesn't know what to replace the entity with. This error prevents the document from being well-formed.

2.5.1. Character Entities

Entities that contain a single character are called, naturally, character entities. These fall into several groups:


Predefined character entities

Some characters cannot be used in the text of an XML document because they conflict with the special markup delimiters. For example, angle brackets (<>) are used to delimit element tags. The XML specification provides the following predefined character entities, so you can express these characters safely:

NameValue
amp&
apos'
gt>
lt<
quot"


Numbered character entities

XML supports Unicode, a huge character set with tens of thousands of different symbols, letters, and ideograms. You should be able to use any Unicode character in your document. The problem is how enter a nonstandard character from a keyboard with less than 100 keys, or how to represent one in a text-only editor display. One solution is to use a numbered character entity, an entity whose name is of the form #n, where n is a number that represents the character's position in the Unicode character set.

The number in the name of the entity can be expressed in decimal or hexadecimal format. For example, a lowercase c with a cedilla (ç) is the 231st Unicode character. It can be represented in decimal as &#231; or in hexadecimal as &#xe7;. Note that the hexadecimal version is distinguished with an x as the prefix to the number. The range of characters that can be represented this way starts at zero and goes up to 65,536. We'll discuss character sets and encodings in more detail in Chapter 7.


Named character entities

The problem with numbered character entities is that they're hard to remember: you need to consult a table every time you want to use a special character. An easier way to remember them is to use mnemonic entity names. These named character entities use easy-to-remember names for references like &THORN;, which stands for the Icelandic capital thorn character (Þ).

Unlike the predefined and numeric character entities, you do have to declare named character entities. In fact, they are technically no different from other general entities. Nevertheless, it's useful to make the distinction, because large groups of such entities have been declared in DTD modules that you can use in your document. An example is ISO-8879, a standardized set of named character entities including Latin, Greek, Nordic, and Cyrillic scripts, math symbols, and various other useful characters found in European documents.

2.5.2. Mixed-Content Entities

Entity values aren't limited to a single character, of course. The more general mixed-content entities have values of unlimited length and can include markup as well as text. These entities fall into two categories: internal and external. For internal entities, the replacement text is defined in the entity declaration; for external entities, it is located in another file.

2.5.2.1. Internal entities

Internal mixed-content entities are most often used to stand in for oft-repeated phrases, names, and boilerplate text. Not only is an entity reference easier to type than a long piece of text, but it also improves accuracy and maintainability, since you only have to change an entity once for the effect to appear everywhere. The following example proves this point:

<?xml version="1.0"?>
<!DOCTYPE press-release SYSTEM "http://www.dtdland.org/dtds/reports.dtd" 
[
  <!ENTITY bobco "Bob's Bolt Bazaar, Inc.">
]>
<press-release>
<title>&bobco; Earnings Report for Q3</title>
<par>The earnings report for &bobco; in fiscal
quarter Q3 is generally good. Sales of &bobco; bolts increased 35%
over this time a year ago.</par>
<par>&bobco; has been supplying high-quality bolts to contractors
for over a century, and &bobco; is recognized as a leader in the
construction-grade metal fastener industry.</par>
</press-release>

The entity &bobco; appears in the document five times. If you want to change something about the company name, you only have to enter the change in one place. For example, to make the name appear inside a <companyname> element, simply edit the entity declaration:

<!ENTITY bobco 
  "<companyname>Bob's Bolt Bazaar, Inc.</companyname>">

When you include markup in entity declarations, be sure not to use the predefined character entities (e.g., &lt; and &gt;). The parser knows to read the markup as an entity value because the value is quoted inside the entity declaration. Exceptions to this are the quote-character entity " and the single-quote character entity '. If they would conflict with the entity declaration's value delimiters, then use the predefined entities, e.g., if your value is in double quotes and you want it to contain a double quote.

Entities can contain entity references, as long as the entities being referenced have been declared previously. Be careful not to include references to the entity being declared, or you'll create a circular pattern that may get the parser stuck in a loop. Some parsers will catch the circular reference, but it is an error.

2.5.2.2. External entities

Sometimes you may need to create an entity for such a large amount of mixed content that it is impractical to fit it all inside the entity declaration. In this case, you should use an external entity, an entity whose replacement text exists in another file. External entities are useful for importing content that is shared by many documents, or that changes too frequently to be stored inside the document. They also make it possible to split a large, monolithic document into smaller pieces that can be edited in tandem and that take up less space in network transfers. Figure 2.15 illustrates how fragments of XML and text can be imported into a document.

Figure 2.15. Using external entities to import XML and text

External entities effectively break a document into multiple physical parts. However, all that matters to the XML processor is that the parts assemble into a perfect whole. That is, all the parts in their different locations must still conform to the well-formedness rules. The XML parser stitches up all the pieces into one logical document; with the correct markup, the physical divisions should be irrelevant to the meaning of the document.

External entities are a linking mechanism. They connect parts of a document that may exist on other systems, far across the Internet. The difference from traditional XML links (XLinks) is that for external entities, the XML processor must insert the replacement text at the time of parsing. See Chapter 3 for others kinds of links.

External entities must always be declared, so the parser knows where to find the replacement text. In the following example, a document declares the three external entities &part1;, &part2;, and &part3; to hold its content:

<?xml version="1.0"?>
<!DOCTYPE doc SYSTEM "http://www.dtds-r-us.com/generic.dtd"
[
  <!ENTITY part1 SYSTEM "p1.xml">
  <!ENTITY part2 SYSTEM "p2.xml">
  <!ENTITY part3 SYSTEM "p3.xml">
]>
<longdoc>
  &part1;
  &part2;
  &part3;
</longdoc>

This process is illustrated in Figure 2.16. The file at the top of the pyramid contains the document declarations and external entity references, so we might call it the "master file." The other files are subdocuments, pieces of XML that are not documents in their own right. It would be an error to insert document prologs into each subdocument, because then we would no longer have one logical document.

Figure 2.16. A compound document

Since they are merely portions of a document and lack document prologs of their own, the subdocuments cannot be validated individually (although they may still qualify as well-formed documents without document prologs). The master file can be validated, because its parts are automatically imported by the parser when it sees the external entities. Note, though, that if there is a syntax error in a subdocument, that error will be imported into the whole document. External entities don't shield you from parsing or validity errors.

The syntax just shown for declaring an external entity uses the keyword SYSTEM followed by a quoted string containing a filename. This string is called a system identifier and is used to identify a resource by location. The quoted string is actually a URL, so you can include files from anywhere on the Internet. For example:

<!ENTITY catalog SYSTEM "http://www.bobsbolts.com/catalog.xml">

The system identifier suffers from the same drawback as all URLs: if the referenced item is moved, the link breaks. To avoid that problem, you can use a public identifier in the entity declaration. In theory, a public identifier will endure any location shuffling and still fetch the correct resource. For example:

<!ENTITY faraway PUBLIC "-//BOB//FILE Catalog//EN"
    "http://www.bobsbolts.com/catalog.xml">

Of course, for this to work, the XML processor has to know how to use public identifiers, and it must be able to find a catalog that maps them to actual locations. In addition, there's no guarantee that the catalog is up to date. A lot can go wrong. Perhaps for this reason, the public identifier must be accompanied by a system identifier (here, "http://www.bobsbolts.com/catalog.xml"). If the XML processor for some reason can't handle the public identifier, it falls back on the system identifier. Most web browsers in use today can't deal with public identifiers, so perhaps the backup is a good idea.

2.5.3. Unparsed Entities

The last kind of entity discussed in this chapter is the unparsed entity. This kind of entity holds content that should not be parsed because it contains something other than text and would likely confuse the parser. Unparsed entities are used to import graphics, sound files, and other non-character data.

The declaration for an unparsed entity looks similar to that of an external entity, with some additional information at the end. For example:

<?xml version="1.0"?>
<!DOCTYPE doc [
  <!ENTITY mypic SYSTEM "photos/erik.gif" NDATA GIF>
]>
<doc>
  <para>Here's a picture of me:</para>

  &mypic;

</doc>

This declaration differs from an external entity declaration in that there is an NDATA keyword following the system path information. This keyword tells the parser that the entity's content is in a special format, or notation, other than the usual parsed mixed content. The NDATA keyword is followed by a notation identifier that specifies the data format. In this case, the entity is a graphic file encoded in the GIF format, so the word GIF is appropriate.

The notation identifier must be declared in a separate notation declaration, which is a complex affair discussed in Chapter 5. GIF and other notations are not built into XML, and an XML processor may not know what to do with them. At the very least, the parser will not blindly load the entity's content and attempt to parse it, which offers some protection from errors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset