3.2. Specifying Resources

To create a link to an object, we need to identify it. This is usually done by a string of characters called a uniform resource identifier (URI). There are two main categories of URI: the first uniquely identifies a resource based on its location, and the second gives the resource a unique name and relies on a table somewhere in the system to map names to physical locations.

A URI begins with a scheme, a short name that specifies how you're identifying the item. Often, it's a communications protocol like HTTP or FTP. This is followed by a colon (:) and a string of data that uniquely identifies the resource. Whatever the scheme, it must identify one resource uniquely.

The following sections describe the two types of URI in more detail.

3.2.1. Specifying Resources by Location

The type of URI most people are familiar with is the uniform resource locator (URL), which belongs to the first category: it uses location to directly identify a resource. The URL works like the address on a letter, where you specify a country, a state or province, a street address, and optionally an apartment number. Each additional piece of information in the address narrows down the location until it resolves to one place; thus, the postal address makes a good unique identifier.

Similarly, the URL uses the nomenclature of computer networks. This information can include a computer's domain name, its filesystem path,[1] and any other system-specific information that helps locate the resource.

[1] There's no requirement that the path part of a URL be a real filesystem path. Some schemes rely on a completely different kind of path, say a hierarchy of keywords. But in our examples we talk about filesystem paths; they are by far the most common way to locate files on a system.

The URL begins with a scheme that identifies a particular addressing method or communications protocol to be used. Many schemes have been defined, including hypertext transfer (HTTP), file transfer (FTP), and others. For example, an HTTP URL, used for locating web documents, looks like this:

http://address/path

The other parts of the HTTP URL are as follows:


address

The address of the system. The most common way to address a system is with a domain name, which contains a series of names for network levels separated by periods. For example, www.oreilly.com is the domain name for the web server at O'Reilly & Associates. The server exists in the com top-level domain for commercial networks. More specifically, it is part of the oreilly subdomain for the O'Reilly network, on a machine identified as www.


path

The system path to the resource. Within a computer system there can be many thousands of files. A universal system for locating files on a system uses a string called a path, which lists successively deeper directories separated by slashes.[2] For example, the path /documents/work/sched.html locates a file called sched.html in the subdirectory work of the main directory documents.

[2] Different systems have their own internal path representations; for example, MS-DOS uses backslashes () and Macintosh uses colons (:). In a URL, the path separator is always a forward slash (/).

Here are some examples of URLs:

http://www.w3c.org/Addressing/
ftp://ftp.fossil-hunters.org/pub/goodsites.pdf
file://www.laffs.com/clownwigs/catalog.txt

A URL can be extended to include additional information. A fragment identifier appended to the end of a URL with a hash symbol (#) refers to a location within the file. It can be used with only a few resource types, such as HTML and XML documents. The fragment identifier must be declared inside the target file, in an attribute. In HTML, it's called an anchor, and uses the <a> element like this:

<a name="ziggy">

In XML, you would use an ID attribute in any element you wish:

<section id="ziggy">

To link to either of these elements, simply append a fragment identifier to the URL:

http://cartoons.net/buffoon_archetypes.htm#ziggy

You can also send arguments to programs by appending a question mark (?) followed by the arguments to the URL, separated by ampersands (&). For example, linking to the following URL calls the program clock.cgi and passes it two parameters, zone (the time zone) and format (the output format):

http://www.tictoc.org/cgi-bin/clock.cgi?zone=gmt&format=hhmmss

The URLs we've described so far are absolute URLs, meaning they are written out in full. This is a cumbersome way to write out a URL, but there is a shortcut. Every absolute URL has a base component, including the system and path information, which, in addition, can be expressed as a URL. For example, the base URL of http://www.oreilly.com/catalog/learnxml/index.html is http://www.oreilly.com/catalog/learnxml/. If the target resource in a link shares part of the base URL with the local resource, you can use a relative URL. This is an absolute URL with part of the beginning lopped off.

The table below shows some examples of URLs. The URLs in the first column are equivalent to those in the second column. Assume that the source URL is http://www.oreilly.com/catalog/learnxml/index.html.

Relative URLAbsolute URL
www.oreilly.com/catalog/learnxml/desc.htmlhttp://www.oreilly.com/catalog/learnxml/desc.html
../../http://www.oreilly.com/catalog/
errata/http://www.oreilly.com/catalog/learnxml/errata/
/http://www.oreilly.com/
/catalog/learnxml/desc.htmlhttp://www.oreilly.com/catalog/learnxml/desc.html

It's a good idea to use relative URLs wherever possible. Not only is it less to type, but if you ever decide to move an interlinked collection of documents to another place, the links will still be valid since only the base URL will have changed.

There may be times when you want to set the base URL explicitly. Perhaps the XML processor isn't smart enough to figure it out, or perhaps you want to link to many files in a different location. The attribute xml:base is used to set a default base URL for all relative URLs in its scope, which is the whole subtree of the element it appears in. For example:

<?xml version="1.0"?>
<html>
  <head>
    <title>Book Information</title>
  </head>
  <body>
    <ul xml:base="http://www.oreilly.com/catalog/learnxml/">
      <li><a href="index.html">Main page</a></li>
      <li><a href="desc.html">Description</a></li>
      <li><a href="errata/">Errata</a></li>
    </ul>
    <p xml:base="http:www.coolbooks.com/reviews/">
      There's also a <a href="lxml.html">review of
      the book</a> available.
    </p>
  </body>
</html>

No matter where this document is located, its links will always point to the same place because the base information is hard-coded.

3.2.2. Specifying Resources by Name

The resource-location scheme relies on resources remaining in one place. When the target resource moves from one location to another, the link breaks. Unfortunately, this happens all the time. Files and systems get moved around, renamed, or removed altogether. When that happens, links to those resources are unusable until the source document is updated. To alleviate this problem, a different scheme has been proposed: resource names.

The philosophy behind resource-naming schemes is that a unique name never changes, no matter where the item moves. For example, the typical American citizen has a nine-digit social security number that she will carry throughout her life. Other details will change, such as her driver's license number, her street address, or even her name, but the SSN will not. Whether she lives in Portland, St. Louis, or Walla Walla, the SSN will always point to her.

Location-independent schemes for finding resources eliminate the problem of breaking links, so why aren't they used more frequently? It is certainly more convenient to type a keyword or two in your web browser and have it always bring you to the right place, even if the address has changed. However, such schemes are still new and not well-defined in contrast to more popular direct-addressing methods. Network addresses work because every computer system handles them the same way, by using IP addressing, which is built into the TCP/IP stack of your computer's operating system. A resource-naming scheme requires a means of mapping the unique name to a changing address, perhaps in a configuration file, and it requires software that knows how to look up the addresses.

One common resource-naming scheme used in XML uses an identifier known as the formal public identifier (FPI).[3] An FPI is a text string that describes several traits about a resource. Taken together, this information creates an identifying label. The FPI usually appears in document type declarations (see Chapter 2) and entity declarations (see Chapter 5).

[3] A formal ISO standard: ISO-8879.

The syntax for an FPI is shown in Figure 3.3. An FPI starts with a symbol (1) representing the registration status of the identifier: a plus sign if it's registered and publicly recognizable, a minus sign if it isn't, and ISO if it belongs to the ISO. The symbol is followed by a separator consisting of two slashes (2), and then the owner identifier (3), which is a short string that identifies the owner or maintainer of the entity that the FPI represents.[4] After another separator comes the public text class (4) describing the kind of resource the FPI represents (for example, DTD for a document type definition). The public text class is followed by a space and a short description of the resource (5), such as its name or purpose. Finally, there is another separator followed by a two-letter code specifying the language of the resource, if applicable (6).

[4] Note that if the owner identifier is unregistered, it may not be unique.

Figure 3.3. Formal public identifier syntax

Consider the following example, a formal public identifier belonging to an unregistered owner of a written DTD in English:

The minus sign (-) means that the organization sponsoring the FPI is not formally registered with a public body such as the ISO.

The institution responsible for maintaining this document is ORA, short for O'Reilly & Associates.

DTD indicates that the type of document being referred to is a document type definition. It's followed by a text description, DocBook Lite XML 1.1, which includes the object's name, version number, and other aspects in a brief string.

The two-letter language code EN names the primary language of the document as English. The language codes are defined in ISO-639.

To complete the link, an XML processor needs to know how to get the physical location of the resource from the FPI. The mechanism for doing that generally involves looking up the name in a table called a catalog. This is usually a file that resides on your system, containing columns of FPIs and the system paths to the resources. Catalogs used for looking up addresses from FPIs are described formally by the OASIS group in their technical resolution 9401:1997, which you can find at http://www.oasis-open.org/html/a401.htm. An online form for resolving FPIs exists at http://www.ucc.ie/cgi-bin/public.

In XML, you cannot use an FPI alone in an entity declaration. It must always be followed by a system identifier (the keyword SYSTEM, followed by a system path or URL in quotes). The designers of XML felt it was risky to rely on XML processors to obtain the physical location from the public identifier, and that a hint should be included. This dilutes the value of the public identifier, but is probably a good idea, at least until FPIs are more widely used.

3.2.3. Internal Linking with ID and IDREF

So far, we've talked about how to identify whole resources, but that's just scratching the surface. You might be after a specific piece of data deep inside a document. How do you go about locating one element from among thousands, all of the same type? One simple way is to label it. The ID and IDREF attributes, described next, let you label an element and link to the element with that label.

3.2.3.1. ID: unique identifiers for elements

In the United States, a commonly used unique identifier is the Social Security Number (SSN). No two people in the country can have the same nine-digit SSN (or else one of them is probably doing something they shouldn't be doing). You wouldn't call your pal by her SSN: "Hey, 456-02-9211, can I borrow your car?" But it's a convenient number for institutions such as the government or an insurance company to use as an account number, as it ensures they won't cross two people by mistake. In this same vein, XML provides a special element marker that is guaranteed to match one and only one element per document.

This marker is in the form of an attribute. Attributes have different types, and one of them is ID. When you define an attribute in a DTD as type ID (see Chapter 5 for details on DTDs), the attribute takes on a special significance to the XML parser. The value of the attribute is treated as a unique identifier, a string of characters that may not be used in any other ID attribute in the document, like this:

<sandwich lbl="blt">Bacon, lettuce, tomato on rye</sandwich>
<sandwich lbl="ham-n-chs">Ham and swiss cheese on roll</sandwich>
<sandwich lbl="turkey">Turkey, stuffing,
    cranberry sauce on bulky roll</sandwich>

These three elements all have an lbl attribute defined in a DTD as type ID. Their values are strings of non-space characters, and each is different. It would be an error if two or more lbl attributes had the same value. In fact, no two attributes of type ID can have the same value even if they have different names.

Let's think about that for a moment. It seems rather strict to require IDs to be different. Why do we need the parser to check for similarity? The reason is that it will save you tons of grief later when you're using the IDs as endpoints for links. In a simple two-sided link, you want to specify one and only one target. If there were two or more with the same identifier, it would be an ambiguous situation with no way to predict where the link will end up.

The problem of ambiguous element labels comes up a lot in HTML. To create a label in an HTML document, you have to have an anchor: an <A> element with a NAME attribute set to some character string. For example:

<A NAME="beginning_of_the_story">

Now, if you make a mistake and have two <A> labels with the same value, HTML has no problem with that. The browser doesn't complain, and the link works just fine. The problem is that you don't know where you'll end up. Perhaps the link will connect with the first instance, or maybe it won't. The HTML specification doesn't say one way or the other. If you're a web designer or author, you may end up pulling your hair out trying to figure out why the link doesn't go where you want it to.

So, by being strict, XML saves us embarrassment and confusion later. We know when we test the validity of the document that all IDs are unique, and all is well with the links—assuming the target can be found. This is the role of IDREF, as we will see later.

Which elements get IDs is up to you, but you should exercise some restraint. Though it may be tempting to give every element its own ID on the remote chance that you might want to link to it, you're better off labeling only major elements. In a book, for example, you would probably add IDs to chapters, sections, figures, and tables, which frequently are the targets of references in the text, but you wouldn't need to give IDs to most inline elements.

You should also be careful about the syntax of your labels. Try to think of names that are easy to remember and relevant to the context, like "vegetables-rutabaga" or "intro-chapter". A hierarchical naming structure can be used to match the actual structure of the document. ID values like "k3828384" or "thingy" are bad because it's nearly impossible to remember what they are or what they stand for. Don't rely on numbers, if you can help it, in case you need to shuffle things around; IDs like "chapter-13" are not a great idea.

3.2.3.2. IDREF: guaranteed, unbroken links

XML provides another special attribute type called IDREF. As its name implies, it's a reference to an ID somewhere in the same document. There is no way in XML to describe the relationship between the referred and referring elements. All we can say is that some relationship exists, which is defined in a stylesheet or processing application. This might seem to be of limited value, but in fact it gives us an extremely simple and effective mechanism for connecting two or more elements without resorting to a complex XLink structure, as described in Section 3.4 later in this chapter.

There's another benefit. We have seen how ID attributes are guaranteed to be unique within a document. IDREF attributes have a guarantee of their own: any ID value referenced by an IDREF must exist in the same document. If an ID link is broken, the parser lets you know and you can fix it before your document goes live.

What can you use IDs and IDREFs for? Here's a short list of possibilities:

  • Cross-references to parts of a book, such as tables, figures, chapters, and appendixes

  • Indexes and tables of contents for a document with many sections

  • Elements that denote a range and can appear in another element, such as terms in an index that span several pages

  • Links to footnotes and sidebars

  • Cross-references within an object-oriented database whose physical structure may not match its logical structure

For instance, you may have several footnotes in a document that share the same text. In this example, <footnoteref> is an element that links to a <footnote> with the implication that it will inherit the target element's text when the document is processed:

<para>The wumpus<footnote id="donut-warning">
Do not try to feed this animal donuts!</footnote>
lives in caves and hunts unsuspecting computer nerds. It is related
to the jabberwock<footnoteref idref="donut-warning"/>,
which prefers to hunt its prey in the open.</para>

A subtle point in using IDREF is knowing what to reference. For example, if you want to reference a chapter with the purpose of including its title in the displayed text, should you point to the chapter's title or to the chapter element itself? Usually it is best to refer to the most general element that fits the meaning of your link, in this case the chapter. You may change your mind later and decide to omit the title, displaying instead the chapter number or some other attribute. Let the stylesheet worry about how to find the information it needs for presentation. In the markup, you should concentrate on meaning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset