True HTML output

The previous section outlines a number of issues that are raised when attempting to output an HTML document that conforms to the XML standard. To avoid these problems, the standard includes a mechanism for outputting documents that do not conform to the XML standard. The Output element is used to specify another format. Currently, the value of its Method attribute must be 'xml' (the default option), 'html' or 'text'. This element must appear directly within the Stylesheet (or Transform) element:

<stylesheet ...>
  <output method="html" ... />
  ...
</stylesheet>

To specify that the processor should output true HTML data, the value 'html' is used. When the 'html' value is entered in the Method element, HTML tags are recognized and processed intelligently.

Note that the default Method value for the Output element should already be 'html' if the document element to be output is named 'html', 'HTML', 'Html', or some other variant, has no namespace prefix, and is preceded by nothing (or at most by whitespace characters).

HTML output characteristics

Perhaps the most significant change when outputting in HTML mode is that empty elements are reformatted to use HTML conventions. For example, the empty element BR is output as a normal start-tag, and the end-tag is omitted:

   ... first line<breakline/>second line ...


<xsl:template match="breakline">
  <BR/>
</xsl:template>


   ... first line<BR>second line ...

Special markup characters are the same in HTML as in XML and serve the same purpose. In HTML, '&amp;' is used to represent '&', '&lt;' is used to represent '<' and '&gt;' is used to represent '>' wherever usage of the actual character could confuse the browser as it attempts to identify entity references and element tags. However, an HTML document does not use the escape sequences in some special cases, and so the XSL processor must output original characters in these circumstances. Specifically, the content of the SCRIPT and STYLE elements conform to formats other than HTML (typically JavaScript and CSS respectively).

<SCRIPT> ...  ( A < B ) ... </SCRIPT>

In HTML, an attribute is allowed to contain the '<' character. The entity reference '&lt;' should therefore be transformed to '<':

<xsl:template ...>
  <IMG ... ALT="figure showing that 6 &lt; 7"></IMG>
</xsl:template>


   ... <IMG ... ALT="figure showing 6 < 7"> ...

Recognizing HTML elements

Even when HTML mode has been set using the Output element, this mode is only used for elements that have no namespace mapping to a URL. To distinguish HTML elements from XSLT elements, it is therefore usual to map the XSLT elements to a prefix (instead of the HTML elements):

<xsl:stylesheet ... xmlns:xsl="...">
  <xsl:template...>
    <HTML>
      <HEAD><TITLE>The Page Title</TITLE></HEAD>
      <xsl:apply-templates/>
    </HTML>
  </xsl:template>
</xsl:stylesheet>

There have been a number of HTML versions released since the introduction of this language in 1995. The number of tags differs between versions, with the general trend toward longer lists of allowed tags appearing in later versions. The Version attribute specifies the version of HTML to use for the output. A value of '4.0' (the default value) specifies HTML 4.0 output (where other valid options include '3.2' and '2.0'):

<output ... version="4.0" />

In future, this attribute value may be used to warn an XSLT processor that it is about to process a version of HTML that was released after the program was created, and which it therefore does not understand. In this case, it may warn the user of the problem, or even refuse to process the document.

Elements that are deemed to be HTML elements by their namespace designation (using the default namespace), but are not recognized as elements belonging to the specified version of HTML, are nevertheless treated as HTML elements. But, because they are unknown, they cannot be processed intelligently. The fallback behaviour is to treat such elements as in-line, non-empty elements:

<P>This paragaph contains an <UNKNOWN>unknown</UNKNOWN>
element.</P>

HTML element tag names are not case-sensitive. A tag in the stylesheet will be recognized as an HTML tag if the name matches, regardless of the case used. The following examples both produce BR elements (without the end-tag):

<xsl:template...>
  <BR></BR>broken text<br></br>
  <xsl:apply-templates/>
</xsl:template>


   ... <BR>broken text<br> ...

HTML processing instructions differ slightly from XML processing instructions. The question-mark delimiter is not needed at the end of the tag (the Processing Instruction element is discussed in Chapter 12):

<template...>
  <processing-instruction name="SPECIAL">some-
    instruction</processing-instruction>
  <apply-templates/>
</template>


   ... <?SPECIAL some-instruction> ...

When an HTML attribute can have only one value, and the presence or absence of the value acts as a simple switch, the convention is to include only the value in the element start-tag (omitting the attribute name, equals symbol and enclosing quotes). For example, the OL element (ordered list) may take the 'COMPACT' attribute to close up items in the list.

<xsl:template...>
  <OL COMPACT="normal"><xsl:apply-templates/></OL>
</xsl:template>


   <OL COMPACT> ... </OL>

DOCTYPE declaration

The HTML file produced can be given a document type declaration by specifying the Doctype Public and/or the Doctype System attributes:

<output ... doctype-system="html4.0.dtd" />


   <!DOCTYPE HTML SYSTEM "html4.0.dtd">
   <HTML>...</HTML>


<output ... doctype-public="-//html..."
            doctype-system="html4.0.dtd" />


   <!DOCTYPE HTML PUBLIC "-//html..." "html4.0.dtd">
   <HTML>...</HTML>

Characters

The Encoding attribute in the Output element specifies the output character scheme. Its value should be copied to the 'charset' attribute of a suitable META element in the HEAD element (if the HEAD element has been created):

<output ... encoding="ISO-8859-1" media-type="text/html"/>


   <HEAD>
     <META http-equiv="Content-Type"
           content="text/html" charset="ISO-8859" >
   </HEAD>
   ...

Note that the Media Type attribute is also used, to populate the 'content' attribute.

The HTML format allows for the possible use of entity references for many extended characters. The reference names are taken from sets defined by the ISO, and so are often the same as those defined in XML source documents. For example, '&eacute;' is almost universally used to represent the e-acute character.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset