Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. XML concepts for Office users Introductory Discussion

What is XML, really?
Four principles of generalized markup
Abstraction vs. rendition
Elements and attributes
XML and the Web

Virtually every influential company in the software industry is promoting XML as the next step in the Web’s evolution. Why? Because it enables information sharing, and that is the key to electronic commerce, application integration, and many other desirable things.

And now Microsoft has bet the future of the world’s most successful office suite on XML. How can these companies be so confident about something so new? More important: how can you be sure that your time invested in learning and using XML will be profitable?

We can all safely bet on XML because its technology is in fact very old and has been proven effective over several decades and thousands of projects. The easiest way to understand the central ideas of XML is to go back to their source, the Standard Generalized Markup Language (SGML).

XML is, in fact, a streamlined subset of SGML, so SGML’s track record is XML’s as well. SGML enables information interchange within and between some of the world’s largest companies. Its extensible markup technology was first used for document processing, but over time it has become clear that data and documents are the same thing! To be precise, documents are the interchangeable form of data.

If you understand where XML comes from, you’ll better understand what it is, how to use it, and where it and the desktop are going.

Formatting markup

XML comes from a rich history of text processing systems. Markup actually predates the computer. Figure 2-1 shows a marked-up manuscript that might have been submitted to a human who would compose the type for printing.

Figure 2-1. A manuscript “marked up” by hand

The first wave of automated text processing was computer typesetting. An author would type a document and include style codes to describe how the document should be formatted. The computer would read the style codes and the rest of the text and print the document with the described formatting.

The file that contains the data of the document, plus the description of the desired format, is called a rendition. The style codes in the rendition file are called formatting markup.

The system interprets the formatting markup and converts the rendition into something physically perceivable by a human being – a presentation. The presentation medium was originally paper, but eventually electronic display was added.

This scenario isn’t totally history. You and Word work with renditions today. It’s just that Word gives you a nicer interface to manipulate them.

Word’s user interface to the rendition (that is, to the .doc file with the style codes in it) is designed to look like the presentation (the finished paper product). The interface is called a What You See Is What You Get (WYSI-WYG) interface. Since a rendition describes a presentation, it is convenient to have the user interface reflect the end-product.

More than convenient. For Word it is essential, as we saw in 1.2.1, “Separating the document representation from the software”, on page 9. Word’s native binary .doc rendition file format is unfathomable to humans, and to virtually all non-Microsoft software.

For interchange (and virus avoidance) Word also offers a plain-text equivalent called Rich Text Format (RTF). While computers can handle it better than .doc, humans won’t find it much of an improvement, as a glance at Example 2-1 will demonstrate.^[1]

Example 2-1. Doug’s article, represented in RTF (article.rtf)

pardplains1ql li0
i0sb240sa60keepnwidctlparaspalpha
aspnumfaautooutlinelevel0adjustright
in0lin0itap0
pararsid13583126 f1fs32lang1033langfe1033kerning32cgrid
langnp1033langfenp1033
{insrsid13583126par }
pard s1ql li0
i0sb240sa60keepnwidctlparaspalphaaspnum
faautooutlinelevel0adjustright
in0lin0itap0pararsid5243775
{insrsid10036224 Sales Updatepar }
pardplain ql li0
i0widctlparaspalphaaspnumfaauto
adjustright
in0lin0itap0pararsid10036224 fs24lang1033
langfe1033cgridlangnp1033langfenp1033
{insrsid16529125 Doug Jones}
{insrsid10036224par February 3, 2004par }
{insrsid10036224charrsid10036224par }
pardplain s2ql li0
i0sb240sa60keepnwidctlpar
aspalphaaspnumfaautooutlinelevel1adjustright
in0lin0itap0
pararsid12923755 if1fs28lang1033langfe1033cgridlangnp1033
langfenp1033
{insrsid10036224 A great month!}
{insrsid1358312par }
pardplain ql li0
i0widctlparaspalphaaspnumfaauto
adjustright
in0lin0itap0 fs24lang1033langfe1033cgrid
langnp1033langfenp1033
{insrsid10036224par This month
quote s figures are a }
{iinsrsid10036224charrsid10036224 huge}
{insrsid10036224 improvement over this month last year. We }
{insrsid12923755 sold 1,342 widgets for a total revenue of
$14,327.}
{insrsid10036224parpar }
pardplain s2ql li0
i0sb240sa60keepnwidctlparaspalpha
aspnumfaautooutlinelevel1adjustright
in0lin0itap0
pararsid12923755 if1fs28lang1033langfe1033cgridlangnp1033
langfenp1033
{insrsid10036224 More work to dopar }
pardplain ql li0
i0widctlparaspalphaaspnumfaauto
adjustright
in0lin0itap0 fs24lang1033langfe1033cgrid
langnp1033langfenp1033
{insrsid10036224 Let
quote s not rest on}{insrsid12923755 our
past success. Let
quote s get out there and sell, sell, sell!}
{insrsid10036224par }{insrsid13583126par }

Generalized markup

Formatting markup is sufficient if your only goal is to create a single rendition and then print it. In 1969, IBM asked a young researcher named Charles Goldfarb (the name may sound familiar) to build a system for editing, searching, managing, and publishing legal documents.

Goldfarb found that there were IBM products for each of these tasks but they could not communicate with each other. They could not share information! Each of them used different markup. They could not read each other’s files, just as you may have had trouble loading WordPerfect files into Word or vice versa.

The problem then, as now, was that to integrate these diverse products, a neutral document representation^[2] was needed for the information – one that wasn’t designed for a single product.

Goldfarb – later joined by two other IBM researchers, Ed Mosher and Ray Lorie – set out to solve this problem. The team recognized (eventually) that the solution would need to satisfy four principles:

neutral data representation (markup language)

Various computer programs and systems would need to be able to read and write information in the same representation.

extensible markup

There is an immeasurable variety of types of information that must be exchanged. The markup language must be extensible enough to support them all.

rule-based markup

There must be a formal way of describing the rules followed by documents of the same type. Computers must be able to read and enforce the rules.^[3]

stylesheets

For sharing to work, it must be possible to create document types that aren’t renditions. The real information must be accessible in the abstract, independent of the formatting – or any other processing – instructions. Ideally, the latter should be in separate stylesheets.

These principles are important far beyond the exchange of traditional documentation. In fact they underlie the exchange of any form of information.

The IBM team’s solution was the Generalized Markup Language (GML), which Goldfarb later drew on for his invention of the Standard Generalized Markup Language (SGML) – the parent of HTML and XML.

In the following sections, we’ll look at each of these principles in a bit more detail and see how they apply to Word.

Neutral data representation (markup language)

The need for a neutral data representation is easy to understand. Tools cannot interchange information if they do not speak the same language.

Extensible markup

The IBM team realized that the neutral representation should be specific to legal documents while at the same time being general enough to be used for things that are completely unrelated to the law. This seems like a paradox but it is not as impossible as it sounds!

Vocabularies

This idea is a little more subtle to grasp, but vital to understanding XML. For example, lawyers and scientists both use Latin, but they do not use the same vocabulary.

Similarly, in Example 2-2 we have Doug’s newsletter article in XML, using a vocabulary that was designed for newsletter articles. In Example 2-3, however, we have the same article represented using a different vocabulary.

Example 2-2. Doug’s article, represented in article XML

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<article xmlns="http://xmlinoffice.com/article"
         type="sales" id="A123">
  <title>Sales Update</title>
  <author>Doug Jones</author>
  <date>February 3, 2004</date>
  <body>
   <section>
     <header>A great month!</header>
     <para>This month's figures are a <em>huge</em> improvement over
this month last year. We sold 1,342 widgets for a total revenue of
$14,327.</para>
   </section>
   <section>
     <header>More work to do</header>
     <para>Let's not rest on our past success. Let's get out there
and sell, sell, sell!</para>
   </section>
  </body>
</article>

Example 2-3. Doug’s article, represented in WordML XML (article WordML.xml)

<w:body>
 <wx:sect>
  <wx:sub-section>
   <w:p><w:pPr><w:pStyle w:val="Heading1"/></w:pPr></w:p>
  </wx:sub-section>
  <wx:sub-section>
   <w:p>
    <w:pPr><w:pStyle w:val="Heading1"/></w:pPr>
    <w:r><w:t>Sales Update</w:t></w:r></w:p>
   <w:p><w:r><w:t>Doug Jones</w:t></w:r></w:p>
   <w:p><w:r><w:t>February 3, 2004</w:t></w:r></w:p><w:p/>
   <wx:sub-section>
    <w:p>
     <w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
     <w:r><w:t>A great month!</w:t></w:r></w:p>
    <w:p/>
    <w:p>
     <w:r><w:t>This month's figures are a </w:t></w:r>
     <w:r><w:rPr><w:i/></w:rPr><w:t>huge</w:t></w:r>
     <w:r><w:t> improvement over this month last year. We sold 1,342
widgets for </w:t></w:r><w:proofErr w:type="gramStart"/>
     <w:r><w:t>a total</w:t></w:r><w:proofErr w:type="gramEnd"/>
     <w:r><w:t> revenue of $14,327.</w:t></w:r></w:p><w:p/>
   </wx:sub-section>
   <wx:sub-section>
    <w:p>
     <w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
     <w:r><w:t>More work to do</w:t></w:r></w:p>
    <w:p><w:r><w:t>Let's not rest on our past success. Let's get out
there and sell, sell, sell!</w:t></w:r></w:p><w:p/>
    <w:sectPr>
     <w:pgSz w:w="12240" w:h="15840"/>
     <w:pgMar w:top="1440" w:right="1800" w:bottom="1440"
      w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>
     <w:cols w:space="720"/><w:docGrid w:line-pitch="360"/>
    </w:sectPr>
   </wx:sub-section>
  </wx:sub-section>
 </wx:sect>
</w:body>

Before we look at the differences between the two XML representations, let’s use Example 2-2 to learn the basics of XML: elements and tags, content and data, and attributes.

elements and tags

The document has one title element. It begins with a start-tag (<title>) and ends with a corresponding end-tag (</title>). The forward slash differentiates an end-tag from a start-tag. The purpose of the tags is to identify the type of element (title, in this case) and show where it starts and ends.

content and data

The text in between the tags is the content of the element. In the case of title the content is entirely data characters. However, the content of body has only markup that represents elements, and the first para has mixed content: data and an element (em). Elements that occur in the content of elements form a hierarchy. At the top is the document element (or root element): in this case, article.

attributes

An element can have properties besides its element type name and content. These are represented by name-value pairs called attributes. In the example, only article has attributes. The value of the one named id is the name of this particular article element, which is useful when a group of articles is combined in a newsletter. The xmlns attribute declares a namespace, which we explain in 2.6, “Namespaces”, on page 39.

The attribute names and element-type names, such as title, comprise a custom vocabulary specifically designed to describe newsletter articles. In contrast, Example 2-3 uses a vocabulary called the Word Markup Language (WordML).^[4]

WordML uses XML, as do thousands of other data representations. However, the WordML vocabulary is unique. It was designed by Microsoft to be the XML equivalent of .doc and RTF. As such, it describes components and properties of the Word formatting model. Even if one of its element-type names were the same as one in some other vocabulary, it would probably not identify the same type of thing. (We explain WordML in detail in Chapter 5, “Rendering and presenting XML documents”, on page 86.)

The reason that generalized markup became successful is that users can create their own vocabularies to meet their needs. But all vocabularies use the standard markup language syntax (and other language constructs), so tools can be developed to do the large amount of common processing that is necessary for all vocabularies.

Abstractions and renditions

Computers are not as smart as we are. If we want the computer to consider a piece of text to be written in a foreign language (for instance for spell-checking purposes) then we must label it explicitly foreign-phrase and not just put it in italics! The “foreign phrase” is the abstraction that we are trying to represent; italics is just a particular rendition of that abstraction for visual presentations. For audible presentations, the rendition might be a voice with an accent.

Formatting markup is specific to a particular use of the information. Search engines cannot do very useful searching on italics because they do not know why something is italicized. It could be a foreign phrase but it could also be a citation of another document.

In contrast, the search engine could do something very helpful with suitably-marked citation elements: it could return a list of those documents that are cited by other documents.

Italics are a form of markup specific to a particular application: formatting. In contrast, the citation element is markup that can be used by a variety of applications. That is why Goldfarb named this form of markup generalized markup. Generalized markup is the alternative to formatting markup and other specialized single-use coding schemes.

Caution

Generalized markup is often called structured markup and the act of using it structuring a document. Unfortunately, “structured” is the most misused term in markup languages, with at least four different meanings (see 20.1, “Structured vs. unstructured”, on page 430). This use of it implies that only abstractions have structure, which Figure 2-2 clearly refutes. It shows the Word template ProfessionalReport.dot with the Document Map and Style Area in view. The structural hierarchy is in the left pane and style codes are in the center. A Word template is not just a stylesheet, but also a guide to the structure of the rendition.

Figure 2-2. Structure of a rendition

Rule-based markup

If computer systems are to work with documents reliably, the documents have to follow certain rules. In retrospect we can see that this is important for interchanging information of all sorts, whether it is traditionally considered a document or not.

For instance a courtroom transcript might be required to have the name of the judge, defendant, both attorneys and (optionally) the names of members of the jury (if there is one). Since humans are prone to make mistakes, the computer would have to enforce the rules for us.

In other words the legal markup language should be specified in some formal way that would restrict elements appropriately. If the court stenographer tried to submit a transcript to the system without these elements being properly filled in, the system would check its validity and complain that it was invalid.

Once again, this concept is today very common in the database world. Database people typically have several layers of checking to guarantee that improper data cannot appear in their databases. For instance syntactic checks guarantee that phone numbers are composed of digits and that people’s names are not. Semantic checks ensure that business rules are followed (such as “purchase order numbers must be unique”). The database world calls the set of constraints on the database structure a schema. This word has also caught on in the XML world.

Of course, court transcripts have a different structure from wills, which in turn have a different structure from memos. So you would need to rigorously define what it means for each type of document to be valid. In markup language terminology, each of these is a document type and the formal definition that describes each type is called a document type definition (DTD) or schema definition. These terms refer both to the vocabulary and the constraints on the vocabulary’s use.

Example 2-4 shows a simple DTD expressed with three XML element-type declarations. Example 2-5 shows the equivalent schema definition expressed using the XML Schema Definition Language (XSDL).^[5]

Example 2-4. Markup declarations

<!ELEMENT Q-AND-A (QUESTION,ANSWER)+>
<!-- This allows: question, answer, question, answer ... -->
<!ELEMENT QUESTION (#PCDATA)>
<!-- Questions are just made up of textual data -->
<!ELEMENT ANSWER (#PCDATA)>
<!-- Answers are just made up of textual data -->

Example 2-5. Schema definition

<schema xmlns='http://www.w3.org/2001/XMLSchema'
        xmlns:qa='http://www.q.and.a.com/'
        targetNamespace='http://www.q.and.a.com/'>
 <element name="Q-AND-A">
 <complexType>
  <sequence minOccurs="1" maxOccurs="unbounded">
   <element ref="qa:QUESTION"/>
   <element ref="qa:ANSWER"/>
  </sequence>
 </complexType>
 </element>
<!-- This allows: question, answer, question, answer ... -->
 <element name="QUESTION" type="string"/>
<!-- Questions are just made up of textual data -->
 <element name="ANSWER" type="string"/>
<!-- Answers are just made up of textual data -->
</schema>

Stylesheets

Of course, if you are using XML for publishing, you must still be able to generate high quality print and online renditions of the document. Your readers do not want to read XML text directly. Instead of directly inserting the formatting commands in the XML document, we usually tell the computer how to generate formatted renditions from the XML abstraction.

For example in a print presentation, we can make the content of TITLE elements bold and large, insert page breaks before the beginning of chapters, and turn emphasis, citations and foreign words into italics. These rules are specified in a file called a stylesheet. The stylesheet is where human designers can express their creativity and understanding of formatting conventions. The stylesheet allows the computer to automatically convert the document from the abstraction to a formatted rendition.

Stylesheets for XML invariably conform to the Extensible Style Language Transformations (XSLT) W3C recommendation. XSLT can do much more processing than formatting markup ever could. Often it is used for tasks that don’t involve formatting at all.

Moreover, XSLT stylesheets are normally written to apply to all documents of a given type, rather than a single document. Just as the DTD or schema sets the rules for markup, an XSLT stylesheet is a set of rules for processing, as depicted in Figure 2-3.^[6]

Figure 2-3. Rule-based processing

Elements and the logical structure

Most documents (for example books and magazines) can easily be broken down into components (chapters and articles). These can also be broken down into components (titles, paragraphs, figures and so forth). And those components can be broken down into components until we get to the textual data itself – words and sentences. At this point we would typically stop breaking the document into components unless we were interested in linguistic research.

It turns out that every document can be viewed this way, though some fit the model more naturally than others. In fact all information can be viewed this way...with the same caveat!

In XML, these components are called elements. Each element represents a logical component of a document. Elements can contain other elements and can also contain the words and sentences that you would usually think of as the text of the document. XML calls this text the document’s character data. This hierarchical view of XML documents is demonstrated in Figure 2-4.

Figure 2-4. Hierarchical views of documents

Markup professionals call this the tree structure of the document. The element that contains all of the others (e.g. Book, Article or Memo) is known as the root element. This name captures the fact that it is the only element that does not “hang” off of some other element. The root element is also referred to as the document element because it holds the entire logical document within it. The terms root element and document element are interchangeable.

The elements that are contained in the root are called its subelements. They may contain subelements themselves. If they do, we will call them branches. If they do not, we will call them leaves.

Thus, the Chapter and Section elements are branches (because they have subelements), but the Paragraph and Title elements are leaves (because they only contain character data).

Elements can also have extra information attached to them called attributes. Attributes describe properties of elements. For instance a CIA-record element might have a security attribute that gives the security rating for that element. A CIA database might only release certain records to certain people depending on their security rating. It is somewhat of a judgement call which aspects of a document should be represented with elements and which should be represented with attributes.

Real-world documents do not always fit this tree model perfectly. They often have non-hierarchical features such as cross-references or hypertext links from one section of the tree to another. XML can represent these structures too.

A WordML document also has an element structure. But as the WordML document type is a complex rendition with over 400 element types, the structure isn’t usually as clear as the abstraction in Figure 2-4.

Well-formedness and validity

Every language has rules about what is or is not correct in the language. In human languages that takes many forms: words have a particular correct pronunciation (or range of pronunciations) and they can be combined in certain ways to make valid sentences (grammar). Similarly XML has two different notions of “correct”. The first is merely that the markup is intelligible: the XML equivalent of “getting the pronunciation right”. A document with intelligible markup is called a well-formed document. One important goal of XML was that these basic rules should be simple so that they could be strictly adhered to.

The XML equivalent of “using the right words in the right place” is called validity and is related to the notion of document types. A document is valid if it declares conformance to a DTD in a document type declaration and actually conforms to that DTD.

A document could also (or exclusively) be identified as conforming to a schema. If it actually does conform, it is said to be schema valid (commonly shortened to “valid”).

XML and the World Wide Web

When the Web began it was pretty simple. You entered a Web address into your browser and it displayed the page from that address. The address was a Uniform Resource Locator (URL) and the page was marked up in HTML.

The W3C maintains a DTD for HTML, but many HTML documents don’t conform to it. So browsers attempted to cope with errors rather than report them to the users (who couldn’t correct them in any case).

The Web has since diverged from that simple model, both in terms of the addresses and the delivered pages. The addressing model has gotten richer, and the pages mostly don’t exist until they hit your browser.

Web addresses: URL and URN and URI and IRI

“U R kidding!”, U might think, but we R not!

There really are four different things that look like URLs, and act like them as well. You can safely treat them as equivalent when reading this book, unless we make a point of the difference in a specific context.

They are:

URI

A Uniform Resource Identifier (URI) is the basic form of address on the Web.

URL

The Uniform Resource Locator (URL) is the most common form of URI.

URN

A newer form of URI, Uniform Resource Name (URN), isn’t location-dependent and perhaps will reduce the number of broken links. However, it has yet to catch on because it requires more sophisticated software support (although that doesn’t stop it being used to declare namespaces, as we shall see shortly).

IRI

An Internationalized Resource Identifier (IRI) is a form of URI that allows non-ASCII characters.

Now when you see URI in the text, you’ll know that it isn’t a typo!

Web services

Web addresses can be extended to become URI references, meaning they have parameters after the actual URI, as in: http://www.amazon.com/exec/obidos/ASIN/0130651982/ref%3Dase_charlesfgoldfars/103-3982805-1512612

The reason for the parameters is because large websites no longer keep repositories of static HTML pages. Instead, they analyze the parameters to determine the information you’ve requested and they generate a page that contains the results. The Web address has become a request and the Web page has become a Web service.

In practice, though, Web-based services that are actually offered under the name “Web services” return XML documents, not HTML. The services may offer a REST interface and/or a SOAP interface.^[7]

REST

A REST Web services interface is the same as the interface to any Web resource: your service request is a URI reference that addresses the result document.

SOAP

The SOAP interface treats the service request and response as messages in a business transaction. The result document is encapsulated in the response message; it is not a Web resource and the service request is not its address.

Office products access a REST Web service as they would any network-addressable XML document; REST requires no special treatment (although you may want to write a macro to help users build a particularly complex parameter string).

Therefore, any explicitly-identified Web service support in Office is only for SOAP; in fact, for a specific variety of SOAP called “document-style”. That interface is more complex than REST, for two reasons:

The service request could be an arbitrarily complex XML document.
Both the request and the response are wrapped in a SOAP message, itself a complex XML document that is dynamically customized for each Web service operation.

In this book, we’ll mention SOAP or REST explicitly unless the interface to the Web service isn’t important in context. And whenever we mention external XML documents, that would also include those returned by REST Web services.

XHTML

Another way in which Web pages are changing is that a new version of HTML is starting to be used. It retains the HTML vocabulary, but uses XML syntax.

XHTML – unlike HTML – tends to be valid, as it is usually generated by software. It is used in several components of Office.

Namespaces

There is a problem that arises when you allow anybody to pick names as XML does. The problem is that different people in different places will invariably use the same names for different things. This makes it very difficult to build systems that work with documents from multiple independent sources because a publisher could use the element-type name PAR to mean paragraph while a military vocabulary could use an element type with the same name to mean paratrooper. A mathematician might use an element type with that name to label paradoxes!

Prefixes

There is a standard that addresses this problem. It is known as Namespaces in XML.

A namespace is a conceptual universe within which a defined term – a name – is unique. An XML namespace is slightly different: it is divided into namespace partitions for different kinds of names so that, for example, an element-type name can be the same as an attribute name.

Within a namespace partition (also called a symbol space), each name is unique. It is declared only once and the declared definition applies wherever in the scope of the namespace partition the name is used.

A vocabulary is a namespace. Without the namespaces standard, a document type could only have a single vocabulary, consisting of its element-type and attribute names. The standard lets you mix vocabularies in a document by creating vocabulary nicknames (i.e. abbreviations) that you can prefix to the names to show the vocabulary to which they belong.

For example, you might choose to prefix names from a meteorological information vocabulary with met:. So a document might have elements such as met:temperature, met:humidity and so forth. It could also have health:temperature, from a different vocabulary. To the computer, met:temperature and health:temperature are clearly different names.

Identifiers

Of course, this solution appears to create its own problem: How can you be sure that different XML designers will use met to refer to the same vocabulary? The short and happy answer is: They don’t have to!

That’s because within your document met – which, remember, is just a nickname – is associated with an unambiguous identifier of the vocabulary. That identifier is a URI reference and could look something like http://www.weatherworld.com. Other documents might use a different abbreviation for that vocabulary or use met: to prefix a different vocabulary.

In practice, though, vocabulary developers recommend prefixes and people tend to use the recommended ones.

Note

The URI of a namespace does not have to address a real Web page. The Web addressing system is just being used as a namespace within which all URIs – and therefore namespace names – are unique.

Other XML constructs

There are a few more XML constructs that you may need to know about, but which you will use less often – if at all – than those previously discussed. The details, including information about the nature and extent of Word’s support for them, can be found in Chapter 15, “The XML language”, on page 350.

There are two, however, that show up in the examples that illustrate Part Two: processing instructions (PIs) and comments. They are both markup that is not part of the document structure or data. They are essentially messages: to software in the case of PIs, and to people in the case of comments.

Processing instructions

Example 2-6 shows two processing instructions. The first word in each is the PI target, a nickname for the program for which the PI is intended.

Example 2-6. Processing instructions

<?xml version="1.0" encoding="UTF-8"?>
<?mso-application progid="Word.Document"?>

The first PI is intended for the XML processor itself. It is called an XML declaration and it tells the processor the character encoding that is being used. Other software, such as Office, looks for this PI to recognize that a document is an XML document.

The second PI is intended for Microsoft Office applications. It identifies the application that created the document.

Comments

Comments can be used as reminders to yourself or as messages to other people working on the document. They are discarded when a document is parsed.

Example 2-7. Comment

<!-- Help, I'm a prisoner in a publishing house! -->

More on XML

If you plan to use XML products other than Office, develop your own schemas, or use Office to share data with enterprise systems and Web services, you’ll want to learn more about XML. As our readers will have differing experience with XML and different requirements for its use, there seemed no sensible way to intersperse detailed XML education with the Office XML tasks that are the focus of the book.

Instead we’ve put the detailed tutorials and references on the XML language and related standards where you can easily find them when you need them. They are in Part 3, “XML Tutorials”, on page 348. The book’s Table of Contents and Index can guide you to specific subjects, and we also provide appropriate cross-references to the tutorials from the Office XML task chapters in Part Two.

^[1]Whitespace added for attempted readability. Prolog with font definitions, style definitions, etc. omitted for the same reason.

^[2]Sometimes called a file format.

^[3]History is being compressed here somewhat. Computers that could “read and enforce the rules” didn’t enter the picture until SGML.

^[4]Whitespace added for readability.

^[5]XSDL is the only schema language that Office 2003 can support. We explain it in detail in Chapter 22, “XML Schema (XSDL)”, on page 466.

^[6]We teach you how to create an XSLT stylesheet in Chapter 18, “XSL Transformations (XSLT)”, on page 392.

^[7]REST and SOAP are explained in Chapter 19, “Web services introduction”, on page 414.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. XML concepts for Office users Introductory Discussion

Create new playlist

Sign In

Sign Up