Elements are parts of a document. You can separate a document into parts so they can be rendered differently, or used by a search engine. Elements can be containers, with a mixture of text and other elements. This element contains only text:
<flooby>This is text contained inside an element</flooby>
and this element contains both text and elements:
<outer>this is text<inner>more text</inner>still more text</outer>
Some elements are empty, and contribute information by their position and attributes. There is an empty element inside this example:
<outer>an element can be empty: <nuttin/></outer>
Figure 2.7 shows the syntax for a container element. It begins with a start tag (1) consisting of an angle bracket (<) followed by a name (2). The start tag may contain some attributes (3) separated by whitespace, and it ends with a closing angle bracket (>). An attribute defines a property of the element and consists of a name (4) joined by an equals sign (=) to a value in quotes (5). An element can have any number of attributes, but no two attributes can have the same name. Following the start tag is the element's content (6), which in turn is followed by an end tag (7). The end tag consists of an opening angle bracket, a slash, the element's name, and a closing bracket. The end tag has no attributes, and the element name must match the start tag's name exactly.
As shown in Figure 2.8, an empty element (one with no content) consists of a single tag (1) that begins with an opening angle bracket (<) followed by the element name (2). This is followed by some number of attributes (3), each of which consists of a name (4) and a value in quotes (5), and the element ends with a slash (/) and a closing angle bracket.
An element name must start with a letter or an underscore, and can contain any number of letters, numbers, hyphens, periods, and underscores.[3] Element names can include accented Roman characters; letters from alphabets such as Cyrillic, Greek, Hebrew, Arabic, Thai, Hiragana, Katakana, and Devanagari; and ideograms from Chinese, Japanese, and Korean. The colon symbol is used in namespaces, as explained in Section 2.4, so avoid using it in element names that don't use a namespace. Space, tab, newline, equals sign, and any quote characters are separators for element names, attribute names, and attribute values, so they are not allowed either. Some valid element names are: <Bob>, <chapter.title>, <THX-1138>, or even <_>. XML names are case-sensitive, so <Para>, <para>, and <pArA> are three different elements.
[3] Practically speaking, you should avoid using extremely long element names, in case an XML processor cannot handle names above a certain length. There is no specific number, but probably anything over 40 characters is unnecessarily long.
There can be no space between the opening angle bracket and the element name, but adding extra space anywhere else in the element tag is okay. This allows you to break an element across lines to make it more readable. For example:
<boat type="trireme" ><crewmember class="rower">Dronicus Laborius</crewmember >
There are two rules about the positioning of start and end tags:
The end tag must come after the start tag.
An element's start and end tags must both reside in the same parent.
To understand the second rule, think of elements as boxes. A box can sit inside or outside another box, but it can't protrude through the box without making a hole in the side. Thus, the following example of overlapping elements doesn't work:
<a>Don't <b>do</a> this!</b>
These untangled elements are okay:
<a>No problem</a><b>here</b>
Anything in the content that is not an element is text, or character data. The text can include any character in the character set that was specified in the prolog. However, some characters must be represented in a special way so as not to confuse the parser. For example, the left angle bracket (<) is reserved for element tags. Including it directly in content causes an ambiguous situation: is it the start of an XML tag or is it just data? Here's an example:
<foo>x < y</foo> yikes!
To resolve this conflict, you need to use a special code in place of the offending character. For the left angle bracket, the code is <. (The equivalent code for the right angle bracket is >.) So we can rewrite the above example like this:
<foo>x < y</foo>
Such a substitution is known as an entity reference. We'll describe entities and entity references in Section 2.5.
In XML, all characters are preserved as a matter of course, including the white-space characters space, tab, and newline; compare this to programming languages such as Perl and C, where whitespace characters are essentially ignored. In markup languages such as HTML, multiple sequential spaces are collapsed by the browser into a single space, and lines can be broken anywhere to suit the formatter. XML, on the other hand, keeps all space characters by default.