Chapter 3. Valid XML Documents: Creating Document Type Definitions

Chapter 2, "Creating Well-Formed XML Documents," explains all about creating well-formed XML documents. However, there's more to creating good XML documents than the simple (although essential) requirement that they be well-formed. Because you can create your own tags when you create an XML application, it's up to you to set their syntax. For example, can a <HOUSE> element contain plain text or only other elements such as <TENANT> or <OWNER>? Must a <BOOK> element contain a <PAGE_COUNT> element, or can it get by without one? It's up to you to decide. Using your own custom XML syntax is not only good for making sure that your documents are legible—it can also be essential for programs that deal with documents via code.

XML documents whose syntax has been checked successfully are called valid documents; in particular, an XML document is considered valid if there is a document type definition (DTD) or XML schema associated with it and if the document complies with the DTD or schema. That's all there is to making a document valid. This chapter is all about creating basic DTDs. In the next chapter, I'll elaborate on the DTDs that we create here, showing how to declare entities, attributes, and notations.

You can find the formal rules for DTDs in the XML 1.0 recommendation, http://www.w3.org/TR/REC-xml (which also appears in Appendix A, "The XML 1.0 Specification"). The constraints that documents and DTDs must adhere to create a valid document are marked with the text "Validity Constraint."

Note that DTDs are all about specifying the structure and syntax of XML documents (not their content). Various organizations can share a DTD to put an XML application into practice. We saw quite a few examples of XML applications in Chapter 1, "Essential XML," and those applications can all be enforced with DTDs that the various organizations make public. We'll see how to create public DTDs in this chapter.

Most XML parsers, like the one in Internet Explorer, require XML documents to be well-formed but not necessarily valid. (Most XML parsers do not require a DTD, but if there is one, validating parsers will use it to validate the XML document.)

In fact, we saw a DTD at the end of the previous chapter. In that chapter, I set up an example XML document that stored customer orders named order.xml. At the end of the chapter, I used the DOMWriter program that comes with IBM's XML for Java package to translate the document into canonical XML; to run it through that program, I needed to add a DTD to the document. Here's what it looked like:

<?xml version = "1.0" standalone="yes"?>
<!DOCTYPE DOCUMENT [
<!ELEMENT DOCUMENT (CUSTOMER)*>
<!ELEMENT CUSTOMER (NAME,DATE,ORDERS)>
<!ELEMENT NAME (LAST_NAME,FIRST_NAME)>
<!ELEMENT LAST_NAME (#PCDATA)>
<!ELEMENT FIRST_NAME (#PCDATA)>
<!ELEMENT DATE (#PCDATA)>
<!ELEMENT ORDERS (ITEM)*>
<!ELEMENT ITEM (PRODUCT,NUMBER,PRICE)>
<!ELEMENT PRODUCT (#PCDATA)>
<!ELEMENT NUMBER (#PCDATA)>
<!ELEMENT PRICE (#PCDATA)>
]>
<DOCUMENT>
    <CUSTOMER>
        <NAME>
            <LAST_NAME>Smith</LAST_NAME>
            <FIRST_NAME>Sam</FIRST_NAME>
        </NAME>
        <DATE>October 15, 2001</DATE>
        <ORDERS>
            <ITEM>
                <PRODUCT>Tomatoes</PRODUCT>
                <NUMBER>8</NUMBER>
                <PRICE>$1.25</PRICE>
            </ITEM>
            <ITEM>
                <PRODUCT>Oranges</PRODUCT>
                <NUMBER>24</NUMBER>
                <PRICE>$4.98</PRICE>
            </ITEM>
        </ORDERS>
    </CUSTOMER>
    <CUSTOMER>
        <NAME>
            <LAST_NAME>Jones</LAST_NAME>
            <FIRST_NAME>Polly</FIRST_NAME>
        </NAME>
        <DATE>October 20, 2001</DATE>
        <ORDERS>
            <ITEM>
                <PRODUCT>Bread</PRODUCT>
                <NUMBER>12</NUMBER>
                <PRICE>$14.95</PRICE>
            </ITEM>
            <ITEM>
                <PRODUCT>Apples</PRODUCT>
                <NUMBER>6</NUMBER>
                <PRICE>$1.50</PRICE>
            </ITEM>
        </ORDERS>
    </CUSTOMER>
    <CUSTOMER>
        <NAME>
            <LAST_NAME>Weber</LAST_NAME>
            <FIRST_NAME>Bill</FIRST_NAME>
        </NAME>
        <DATE>October 25, 2001</DATE>
        <ORDERS>
            <ITEM>
                <PRODUCT>Asparagus</PRODUCT>
                <NUMBER>12</NUMBER>
                <PRICE>$2.95</PRICE>
            </ITEM>
            <ITEM>
                <PRODUCT>Lettuce</PRODUCT>
                <NUMBER>6</NUMBER>
                <PRICE>$11.50</PRICE>
            </ITEM>
        </ORDERS>
    </CUSTOMER>
</DOCUMENT>

In this chapter, I'm going to take this DTD apart to see what makes it tick. Actually, this DTD is a pretty substantial one, so to get us started and to show how DTDs work in overview, I'll start with a mini-example first:

<?xml version="1.0"?>
<!DOCTYPE THESIS [
    <!ELEMENT THESIS (P*)>
    <!ELEMENT P (#PCDATA)>
]>
<THESIS>
    <P>
        This is my Ph.D. thesis.
    </P>
    <P>
        Pretty good, huh?
    </P>
    <P>
        So, give me a Ph.D. now!
    </P>
</THESIS>

Note the <!DOCTYPE> element here. Technically, this element is not an element at all, but a document type declaration (DTDs are document type definitions). You use document type declarations to indicate the DTD used for the document. The basic syntax for the document type declaration is <!DOCTYPE rootname [DTD]> (there are other variations we'll see in this chapter) where DTD is the document type definition that you want to use. DTDs can be internal or external, as we'll see in this chapter—in this case, the DTD is internal:

<?xml version="1.0"?>
<!DOCTYPE THESIS [
    <!ELEMENT THESIS (P*)>
    <!ELEMENT P (#PCDATA)>
]>
<THESIS>
    <P>
        This is my Ph.D. thesis.
    </P>
    <P>
        Pretty good, huh?
    </P>
    <P>
        So, give me a Ph.D. now!
    </P>
</THESIS>

This DTD follows the W3C syntax conventions, which means that I specify the syntax for each element with <!ELEMENT>. Using this declaration, you can specify that the contents of an element can be either parsed character data, #PCDATA or other elements that you've created, or both. In this example, I'm indicating that the <THESIS> element must contain only <P> elements but that it can contain zero or more occurrences of the <P> element—which is what the asterisk (*) after P in <!ELEMENT THESIS (P*)> means.

In addition to defining the <THESIS> element, I define the <P> element so that it can only hold text—that is, parsed character data (which is pure text, without any markup), with the term #PCDATA:

<?xml version="1.0"?>
<!DOCTYPE THESIS [
    <!ELEMENT THESIS (P*)>
    <!ELEMENT P (#PCDATA)>
]>
<THESIS>
    <P>
        This is my Ph.D. thesis.
    </P>
    <P>
        Pretty good, huh?
    </P>
    <P>
        So, give me a Ph.D. now!
    </P>
</THESIS>

In this way, I've specified the syntax of these two elements, <THESIS> and <P>. A validating XML processor can now validate this document using the DTD that it supplies.

And that's what a DTD looks like in overview; now it's time to dig into the full details. We're going to take a look at all of them here and in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset