2.2. Elements: The Building Blocks of XML

Elements are parts of a document. You can separate a document into parts so they can be rendered differently, or used by a search engine. Elements can be containers, with a mixture of text and other elements. This element contains only text:

<flooby>This is text contained inside an element</flooby>

and this element contains both text and elements:

<outer>this is text<inner>more
text</inner>still more text</outer>

Some elements are empty, and contribute information by their position and attributes. There is an empty element inside this example:

<outer>an element can be empty: <nuttin/></outer>

Figure 2.7 shows the syntax for a container element. It begins with a start tag (1) consisting of an angle bracket (<) followed by a name (2). The start tag may contain some attributes (3) separated by whitespace, and it ends with a closing angle bracket (>). An attribute defines a property of the element and consists of a name (4) joined by an equals sign (=) to a value in quotes (5). An element can have any number of attributes, but no two attributes can have the same name. Following the start tag is the element's content (6), which in turn is followed by an end tag (7). The end tag consists of an opening angle bracket, a slash, the element's name, and a closing bracket. The end tag has no attributes, and the element name must match the start tag's name exactly.

Figure 2.7. Container element syntax

As shown in Figure 2.8, an empty element (one with no content) consists of a single tag (1) that begins with an opening angle bracket (<) followed by the element name (2). This is followed by some number of attributes (3), each of which consists of a name (4) and a value in quotes (5), and the element ends with a slash (/) and a closing angle bracket.

Figure 2.8. Empty element syntax

An element name must start with a letter or an underscore, and can contain any number of letters, numbers, hyphens, periods, and underscores.[3] Element names can include accented Roman characters; letters from alphabets such as Cyrillic, Greek, Hebrew, Arabic, Thai, Hiragana, Katakana, and Devanagari; and ideograms from Chinese, Japanese, and Korean. The colon symbol is used in namespaces, as explained in Section 2.4, so avoid using it in element names that don't use a namespace. Space, tab, newline, equals sign, and any quote characters are separators for element names, attribute names, and attribute values, so they are not allowed either. Some valid element names are: <Bob>, <chapter.title>, <THX-1138>, or even <_>. XML names are case-sensitive, so <Para>, <para>, and <pArA> are three different elements.

[3] Practically speaking, you should avoid using extremely long element names, in case an XML processor cannot handle names above a certain length. There is no specific number, but probably anything over 40 characters is unnecessarily long.

There can be no space between the opening angle bracket and the element name, but adding extra space anywhere else in the element tag is okay. This allows you to break an element across lines to make it more readable. For example:

<boat
  type="trireme"
><crewmember   class="rower">Dronicus Laborius</crewmember    >

There are two rules about the positioning of start and end tags:

  • The end tag must come after the start tag.

  • An element's start and end tags must both reside in the same parent.

To understand the second rule, think of elements as boxes. A box can sit inside or outside another box, but it can't protrude through the box without making a hole in the side. Thus, the following example of overlapping elements doesn't work:

<a>Don't <b>do</a> this!</b>

These untangled elements are okay:

<a>No problem</a><b>here</b>

Anything in the content that is not an element is text, or character data. The text can include any character in the character set that was specified in the prolog. However, some characters must be represented in a special way so as not to confuse the parser. For example, the left angle bracket (<) is reserved for element tags. Including it directly in content causes an ambiguous situation: is it the start of an XML tag or is it just data? Here's an example:

<foo>x < y</foo>    yikes!

To resolve this conflict, you need to use a special code in place of the offending character. For the left angle bracket, the code is &lt;. (The equivalent code for the right angle bracket is &gt;.) So we can rewrite the above example like this:

<foo>x &lt; y</foo>

Such a substitution is known as an entity reference. We'll describe entities and entity references in Section 2.5.

In XML, all characters are preserved as a matter of course, including the white-space characters space, tab, and newline; compare this to programming languages such as Perl and C, where whitespace characters are essentially ignored. In markup languages such as HTML, multiple sequential spaces are collapsed by the browser into a single space, and lines can be broken anywhere to suit the formatter. XML, on the other hand, keeps all space characters by default.

Sidebar 1. XML Is Not HTML

If you've had some experience writing HTML documents, you should pay close attention to XML's rules for elements. Shortcuts you can get away with in HTML are not allowed in XML. Some important changes you should take note of include:

  • Element names are case-sensitive in XML. HTML allows you to write tags in whatever case you want.

  • In XML, container elements always require both a start and an end tag. In HTML, on the other hand, you can drop the end tag in some cases.

  • Empty XML elements require a slash before the right bracket (i.e., <example/>), whereas HTML uses a lone start tag with no final slash.

  • XML elements treat whitespace as part of the content, preserving it unless they are explicitly told not to. But in HTML, most elements throw away extra spaces and line breaks when formatting content in the browser.

Unlike many HTML elements, XML elements are based strictly on function, and not on format. You should not assume any kind of formatting or presentational style based on markup alone. Instead, XML leaves presentation for stylesheets, which are separate documents that map the elements to styles.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset