Validating XML Documents

One of the benefits of XML is you can verify the data integrity of an XML document by validating it. A valid XML document must be well-formed and follows a set of rules. You can check the validity of an XML document against one of two types of rules: Document Type Definitions (DTDs) and schemas.

A valid XML document must have a DTD or a schema associated with it, against which the correctness of the XML document can be verified. For example, an XML document that contains the data structure for some products may define that the root element is <products> and there are five child elements under it: <name>, <description>, <product_id>, <price>, and <supplier_id>. If a document has a products element as its root, but the supplier_id element is missing, the document is not valid, even though it may be well-formed.

Specifications for DTDs were published earlier than those of schemas, but schemas are more powerful than DTDs. Both are still widely in use today and we will look at both in turn.

Document Type Definition

You can define a DTD or DTDs in the XML document itself, in an external file, or in both. In the following sections, we first look at DTD basics in an internal DTD, then we discuss documents that have external DTDs. The last subsection talks about entities and attributes.

Note

You can find the formal rules for DTDs in the XML 1.0 at www.w3.org/TR/REC-xml.


The first thing to note is that you use <!DOCTYPE> to write a DTD, which always appears in the XML document prolog. There are a few syntaxes for <!DOCTYPE>, this appendix uses the following:

<!DOCTYPE rootName [DTD]>

where rootName is the name of the root in the document and [DTD] is the part that defines all elements—in other words, the root itself and all other elements nested inside the root. Each element is defined by <!ELEMENT>. Because an XML document must have a root, the DTD must have at least one element that defines the root itself. For example, the following is an XML document with an internal DTD. The DTD dictates that the document must have a root called products, and <products> can have no elements nested inside it:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE products [
<!ELEMENT products (#PCDATA)>
]>
<products/>

Note that the XML declaration in the prolog contains the standalone attribute with the value of “yes”. This means this XML document does not refer to any external document. Note also that the DTD defines the <!ELEMENT> for products. #PCDATA stands for parsed character data and means text that does not contain markup.

You can also declare that an element must be empty using the EMPTY keyword. An empty element cannot have a value or a child element, but it can have attributes. For example, the following is the previous XML document with a DTD that states that the root element (products) must be empty:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE products [
<!ELEMENT products EMPTY>
]>
<products></products>

An XML document with only the root and no other elements is not very useful. The <!ELEMENT> in a DTD allows you to define another element. For instance, the following is a DTD that states that the XML document must have <products> as its root and <products> must have a product element. The DTD also states that the product element must have the name and product_id elements:

<!DOCTYPE products [
<!ELEMENT products (product)>
<!ELEMENT product (name,product_id)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT product_id (#PCDATA)>
]>

Listing A.3 shows a valid XML document that uses the previous DTD.

Listing A.3. A valid XML document with an internal DTD
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE products [
<!ELEMENT products (product)>
<!ELEMENT product (name,product_id)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT product_id (#PCDATA)>
]>
<products>
    <product>
        <name>ChicChoc</name>
        <product_id>10</product_id>
    </product>
</products>

When declaring child elements, you can use the following operators that have special meanings. Here x denotes a child element:

  • x*: Zero or more instances of x

  • x+: One or more instances of x

  • x?: Zero or one instance of x

  • x, y: x followed by y

  • x | y: x or y

For example, if you want to say in a DTD that <products> can have zero or more product elements and a product element can have an optional name element but must have a product_id element, use this DTD:

<!DOCTYPE products [
<!ELEMENT products (product)*>
<!ELEMENT product (name?,product_id)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT product_id (#PCDATA)>
]>

External DTDs

Using an external DTD makes sense if the DTD is to be used by multiple XML documents. Also, for a long DTD, an external DTD makes the XML document that uses it tidier.

There are two kinds of external DTDs: private and public. Private DTDs are to be used privately by certain people or applications in a group. You specify an external private DTD using the SYSTEM keyword in the <!DOCTYPE>. On the other hand, a public external DTD can be used by anyone. To make an external DTD public, use the PUBLIC keyword in the <!DOCTYPE>.

Practically the only differences between using an external DTD from an internal DTD are that with external DTDs you have a separate file for the DTD and this DTD file is referenced from inside the XML document. Listing A.4 shows an XML document that uses a private DTD called products.dtd. Because the products.dtd file is referenced without any information about its path, it must reside in the same directory as the XML document.

Listing A.4. A valid XML document referencing a private, external DTD
<?xml version="1.0" standalone="no"?>
<!DOCTYPE products SYSTEM "products.dtd">
<products>
    <product>
        <name>ChocChic</name>
        <product_id>12</product_id>
    </product>
    <product>
        <name>Waftel Chocolate</name>
        <product_id>15</product_id>
    </product>
</products>

And, the following is the products.dtd file.

<!ELEMENT products (product)+>
<!ELEMENT product (name, product_id)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT product_id (#PCDATA)>

You can also reference a private external DTD using its Uniform Resource Locator (URL). In this case, you just specify a URL after the SYSTEM keyword:

<!DOCTYPE products SYSTEM
    "http://www.brainysoftware.com/dtd/products.dtd">

A public external DTD is similar to a private DTD, except that you must define a Formal Public Identifier (FPI) after the PUBLIC keyword in the <!DOCTYPE>. The FPI has four fields, each of which is separated from each other by double forward slashes (//). The first field in an FPI indicates the formality of the DTD. For a DTD that you define yourself, you use a minus (-) sign. If the DTD has been approved by a nonstandard body, you use a plus (+) sign. For a formal standard, use the reference to the standard itself. The second field in an FPI is the name of the organization that maintains the DTD. The third field indicates the type of document being described, and the fourth field specifies the language that the DTD uses. For example, EN stands for English.

This is an example of a <!DOCTYPE> that references an external public DTD:

<!DOCTYPE products PUBLIC "-//bs//Exports//EN"
    "http://brainysoftware.com/products.dtd">

Entities

You can define entities in a DTD. You will probably ask then, what is an entity? To explain it to a programmer like yourself, it is best to draw an analogy between an entity in an XML document and a constant in a computer program. You declare a constant (using the keyword Const in Visual Basic) and assign it a value so that you can reference the value through the constant from within your code. Likewise, you define an entity in a DTD so that you can use it from anywhere in the XML document. You define an entity using the following syntax:

<!ENTITY name definition>

When the XML document is parsed, the entity will be replaced by the value of the entity. To use the entity, you precede the entity name with the ampersand and add a semicolon after the name. For example, to refer to an entity called myEntity, you write &myEntity;.

As an example, Listing A.5 shows an XML document in which an entity named company is declared in its DTD. The value of the entity is “Cooper Wilson and Co.”

Listing A.5. An XML document with an entity
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE products [
<!ELEMENT products (manufacturer, (product)*)>
<!ELEMENT manufacturer (#PCDATA)>
<!ELEMENT product (name, product_id)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT product_id (#PCDATA)>
<!ENTITY company "Cooper Wilson and Co.">
]>
<products>
    <manufacturer>&company;</manufacturer>
    <product>
        <name>ChocChic</name>
        <product_id>12</product_id>
    </product>
    <product>
        <name>Waftel Chocolate</name>
        <product_id>15</product_id>
    </product>
</products>

When an XML parser reads this XML document, it replaces the entity with its value, like the one in Listing A.6.

Listing A.6. An XML document with an entity’s value
<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE products (View Source for full doctype...)>
<products>
    <manufacturer>Cooper Wilson and Co.</manufacturer>
    <product>
        <name>ChocChic</name>
        <product_id>12</product_id>
    </product>
    <product>
        <name>Waftel Chocolate</name>
        <product_id>15</product_id>
    </product>
</products>

You can use five predefined entities in your XML document without declaring them in the DTD: &apos;, &quot;, &amp;, &lt;, and &gt;.

Attributes

You specify attributes that an element has using the following syntax:

<!ATTLIST elementName
						attributeName_1 type_1 defaultValue_1
						attributeName_2 type_2 defaultValue_2
    .
    .
    .
    attributeName_n type_n defaultValue_n>

For example, to define that the product element must have the id attribute, you write the code in Listing A.7.

Listing A.7. An XML document with elements and attributes
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE products [
<!ELEMENT products (product)*>
<!ELEMENT manufacturer (#PCDATA)>
<!ELEMENT product (name, product_id)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT product_id (#PCDATA)>
<!ATTLIST product
						supplier_id CDATA #IMPLIED>
]>
<products>
    <product supplier_id="1">
        <name>ChocChic</name>
        <product_id>12</product_id>
    </product>
    <product supplier_id="2">
        <name>Waftel Chocolate</name>
        <product_id>15</product_id>
    </product>
</products>

The default value #IMPLIED means that the attribute is optional.

Schemas

Like DTDs, schemas validate XML documents. However, schemas are more powerful. Schemas provide the following advantages over DTDs:

  • Additional data types are available using a schema.

  • Schemas support custom data types.

  • A schema uses XML syntax.

  • A schema supports object-oriented concepts such as polymorphism and inheritance.

Note

Schemas are basically XML documents. By convention a schema has an .xsd extension. The term instance document is often used to describe an XML document that conforms to a particular schema. A schema does not have to reside in a file, though. It may be a stream of bytes, a field in a database record, or a collection of XML Infoset “information items.”


When discussing schemas, it is convenient to refer to elements as simple types and complex types. Elements that contain subelements or carry attributes are complex types, whereas elements that contain numbers (and strings, and dates, and so on) but do not contain subelements are simple types. Some elements have attributes; attributes always have simple types.

The W3C recommendation defines schemas in a three-part document at the following locations:

http://www.w3.org/TR/xmlschema-0/
http://www.w3.org/TR/xmlschema-1/
http://www.w3.org/TR/xmlschema-2/

Each of the elements in the schema has the prefix xsd:, which is associated with the XML Schema namespace through the declaration xmlns:xsd=“http://www.w3.org/2001/XMLSchema” that appears in the schema element. By convention, the prefix xsd: denotes the XML Schema namespace, although you can use any prefix. The same prefix, and hence the same association, also appears on the names of built-in simple types—for example, xsd:string. The purpose of the association is to identify the elements and simple types as belonging to the vocabulary of the XML Schema language rather than the vocabulary of the schema author. For clarity, we just mention the names of elements and simple types and omit the prefix.

Note

Like DTDs, schemas can appear inside an XML document or as external documents. The schemaLocation and xsi:schemaLocation attributes specify the location of an external schema referenced to by an XML document. Interested readers should read the document at http://www.w3.org/TR/xmlschema-0/.


In XML Schema, there is a basic difference between the complex types that allow elements in their content and can carry attributes and the simple types that cannot have element content and cannot carry attributes. There is also a major distinction between definitions that create new types (both simple and complex) and declarations that enable elements and attributes with specific names and types (both simple and complex) to appear in document instances. In this section, we focus on defining complex types and declaring the elements and attributes that appear within them.

You define new complex types using the complexType element; such definitions typically contain a set of element declarations, element references, and attribute declarations. The declarations are not themselves types, but rather an association between a name and the constraints that govern the appearance of that name in documents governed by the associated schema. You declare elements using the element element, and you declare attributes using the attribute element. Listing A.8 is an example of an XML document that uses an inline schema.

Listing A.8. Using an inline schema
<xs:schema
  xmlns:xs='http://www.w3.org/2001/XMLSchema'
  xmlns='xsdBook'
  targetNamespace='xsdBook'
>
  <xs:element name='Book'>
    <xs:complexType>
      <xs:sequence>
        <xs:element name='Title' type='xs:string' maxOccurs='1'/>
        <xs:element name='Author' type='xs:string' maxOccurs='1'/>
      </xs:sequence>
      <xs:attribute name='Edition' type='xs:string' use='optional'/>
    </xs:complexType>
  </xs:element>
</xs:schema>

<hc:Book Edition='1' xmlns:hc='xsdBook'>
  <Title>Dogs are from Mars, Cats are from Venus</Title>
  <Author>T. Sakhira</Author>
</hc:Book>

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset