2.7. Well-Formed Documents

XML gives you considerable power to choose your own element types and invent your own grammars to create custom-made markup languages. But this flexibility can be dangerous for XML parsers if they don't have some minimal rules to protect them. A parser dedicated to a single markup language such as an HTML browser can accept some sloppiness in markup, because the set of tags is small and there isn't much complexity in a web page. Since XML processors have to be prepared for any kind of markup language, a set of ground rules is necessary.

These rules are very simple syntax constraints. All tags must use the proper delimiters; an end tag must follow a start tag; elements can't overlap; and so on. Documents that satisfy these rules are said to be well-formed. Some of these rules are listed here.

The first rule is that an element containing text or elements must have start and end tags.

GoodBad
<list>
  <listitem>soupcan</listitem>
  <listitem>alligator</listitem>
  <listitem>tree</listitem>
</list>
<list>
  <listitem>soupcan
  <listitem>alligator
  <listitem>tree
</list>

An empty element's tag must have a slash (/) before the end bracket.

GoodBad
<graphic filename="icon.png"/>
<graphic filename="icon.png">

All attribute values must be in quotes.

GoodBad
<figure filename="icon.png"/>
<figure filename=icon.png/>

Elements may not overlap.

GoodBad
<a>A good <b>nesting</b> 
example.</a>
<a>This is <b>a poor</a> 
  nesting scheme.</b>

Isolated markup characters may not appear in parsed content. These include <, ]]>, and &.

GoodBad
<equation>5 &lt; 2</equation>
<equation>5 < 2</equation>

A final rule stipulates that element names may start only with letters and underscores, and may contain only letters, numbers, hyphens, periods, and underscores. Colons are allowed for namespaces.

GoodBad
<example-one>
<_example2>
<Example.Three>
<bad*characters>
<illegal space>
<99number-start>

Sidebar 2. Why All the Rules?

Web developers who cut their teeth on HTML will notice that XML's syntax rules are much more strict than HTML's. Why all the hassle about well-formed documents? Can't we make parsers smart enough to figure it out on their own? Let's look at the case for requiring end tags in every container element. In HTML, end tags can sometimes be omitted, leaving it up to the browser to decide where an element ends:

<body>
  <p>This is a paragraph.
  <p>This is also a paragraph.
</body>

This is acceptable in HTML because there is no ambiguity about the <p> element. HTML doesn't allow a <p> to reside inside another <p>, so it's clear that the two are siblings. All HTML parsers have built-in knowledge of HTML, referred to as a grammar. In XML, where the grammar is not set in stone, ambiguity can result:

<blurbo>This is one element.
<blurbo>This is another element.

Is the second <blurbo> a sibling or a child of the first? You can't tell because you don't know anything about that element's content model. XML doesn't require you to use a grammar-defining DTD, so the parser can't know the answer either. Because XML parsers have to work in the absence of grammar, we have to cut them some slack and follow the well-formedness rules.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset