The .NET Framework does not provide a SAX parser implementation. While the SAX API is the de facto standard for event-driven XML processing in Java, Microsoft has chosen a fundamentally different approach for .NET.
As a SAX parser processes XML input, application callbacks are used to signal events such as the start or end of an XML element; this approach is considered a push model.
The .NET approach, based on the abstract System.Xml.XmlReader class, provides a pull model, wherein the application invokes XmlReader members to control how and when the parser progresses through the XML document. This is analogous to a forward-only cursor that provides read-only access to the XML source. As with SAX, XmlReader is a noncaching parser.
The .NET Framework provides three concrete implementations of the XmlReader class: XmlTextReader, XmlValidatingReader, and XmlNodeReader. All classes are members of the System.Xml namespace.
The XmlTextReader class provides the most direct .NET alternative to a Java nonvalidating SAXParser. The XmlTextReader ensures that an XML document is well formed but will not perform validation against a DTD or an XML schema.
The XmlTextReader is a concrete implementation of the abstract XmlReader class but also provides a number of nonoverridden members, which we highlight as they are discussed.
The XmlTextReader class provides a set of overloaded constructors offering flexibility for specifying the source of the XML document to be parsed. For example, the following statement creates a new XmlTextReader using the SomeXmlFile.xml file located in the assembly directory as a source:
XmlTextReader xmlReader = new XmlTextReader("SomeXmlFile.xml");
Alternatively, if the XML data were contained in a String variable named SomeXml, we could use a StringReader as the XmlTextReader source, as in the following example:
XmlTextReader xmlReader = new XmlTextReader(new StringReader(SomeXml));
The principal XmlTextReader constructors are summarized in Table 11-3.
Table 11-3. The Principal XmlTextReader Constructors
Comment | |
---|---|
XmlTextReader(Stream) | Creates an XmlTextReader pulling XML from a System.IO.Stream derivative such as FileStream, MemoryStream, or NetworkStream. Streams are discussed in Chapter 10. |
XmlTextReader(String) | Creates an XmlTextReader pulling XML from a file with the specified URL. |
XmlTextReader(TextReader) | Creates an XmlTextReader pulling XML from a System.IO.TextReader such as StreamReader or StringReader. Readers are discussed in Chapter 10. |
Following creation, the XmlTextReader cursor is positioned before the first XML node in the source.
The XmlTextReader class exposes properties that both control the behavior of the reader and give the programmer access to the state of the reader; these properties are discussed in the following sections.
The XmlTextReader.ReadState property provides read-only access to the current state of the XmlTextReader. Upon creation, the XmlTextReader has a state of Initial. The state changes to Interactive once read operations are performed. The XmlTextReader will maintain an Interactive state until the end of the input file is reached or an error occurs. The ReadState property returns one of the following values from the System.Xml.ReadState enumeration, listed in Table 11-4.
Table 11-4. The System.Xml.ReadState Enumeration
Value | Comment |
---|---|
Closed | The XmlTextReader has been closed using the Close method. |
EndOfFile | The end of the input source has been reached. This state can also be tested using the XmlTextReader.EOF property. |
Error | An error has occurred that prevents further read operations. |
Initial | The XmlTextReader has been created, but no read operations have been called. |
Interactive | Read operations have been called at least once, and further read operations can be attempted. |
If a source stream contains more than one XML document, the ResetState method must be used to reinitialize the XmlTextReader prior to parsing the second and subsequent documents; the ResetState method sets the ReadState property to ReadState.Initialized.
The XmlTextReader has a number of properties that control the way XML files are parsed. Table 11-5 summarizes these properties.
Table 11-5. The XmlTextReader Properties
Property | Comments |
---|---|
Namespaces | Controls whether the XmlTextReader supports namespaces in accordance with the W3C "Namespaces in XML" recommendation. The default value is true. |
WhitespaceHandling | Controls how the XmlTextReader handles white space. The property must be set to a value of the System.Xml.WhitespaceHandling enumeration. Valid values are |
| |
The Whitespace and SignificantWhitespace node types are described later in the Working with XML Nodes section. The default value is All. Not inherited from XmlReader. | |
Normalization | Controls whether the XmlTextReader normalizes white space and attribute values in accordance with the "Attribute-Value Normalization" section of the W3C XML 1.0 specification. |
The default value is false. | |
Not inherited from XmlReader. | |
XmlResolver | Controls the System.Xml.XmlResolver to use for resolving DTD references. By default, an instance of the System.Xml.XmlUrlResolver is used. |
Not inherited from XmlReader. |
The Namespaces property must be set before the first read operation (when the ReadState property is ReadState.Initial), or an InvalidOperationException will be thrown; the other properties can be set at any time while the XmlTextReader is not in a ReadState.Closed state and will affect future read operations.
The XML source represents a hierarchy of nodes that the XmlTextReader retrieves sequentially. Progress through the XML is analogous to the use of a cursor that moves through the XML nodes. The node currently under the cursor is the current node.
The XmlTextReader exposes information about the current node through the properties of the XmlTextReader instance, although not all properties apply to all node types.
As the XmlTextReader reads a node, it identifies the node type. Each node type is assigned a value from the System.Xml.XmlNodeType enumeration. The XmlTextReader.NodeType property returns the type of the current node.
The node types include those defined in the W3C "DOM Level 1 Core" specification and five nonstandard extension types added by Microsoft. Node type values that can be returned by XmlTextReader.NodeType include the following—those that are not defined in DOM are italicized: Attribute, CDATA, Comment, DocumentType, Element, EndElement, EntityReference, None, ProcessingInstructions, SignificantWhitespace, Text, Whitespace, and XmlDeclaration. The additional node types defined by Microsoft are summarized in Table 11-6.
Table 11-6. The Microsoft-Specific Node Types
XmlNodeType | Description |
---|---|
None | Indicates that there is no current node. Either no read operations have been executed or the end of the XML input has been reached. |
EndElement | Represents the end tag of an XML element—for example, </book>. |
SignificantWhitespace | Represents white space between markup in a mixed content mode or white space within an xml:space= ‘preserve’ scope. |
Whitespace | Represents white space in the content of an element. |
XmlDeclaration | Represents the declaration node <?xml version="1.0"...>. |
The Name and LocalName properties of the XmlTextReader return names of the current node. The Name property returns the qualified node name, including any namespace prefix. The LocalName property returns the node name with any namespace prefix stripped off. The name returned by the Name and LocalName properties depends on the current node type, summarized by the following list:
Attribute. The name of the attribute
DocumentType. The document type name
Element. The tag name
EntityReference. The entity reference name
ProcessingInstruction. The processing instruction target
XmlDeclaration. The string literal xml
Other node types. String.Empty
The XmlTextReader.Prefix property returns the namespace prefix of the current node, or String.Empty if it doesn’t have one.
The XmlTextReader.Value property returns the text value of the current node. The value of a node depends on the node type and is summarized in the following list:
Attribute. The value of the attribute
CDATA. The content of the CDATA section
Comment. The content of the comment
DocumentType. The internal subset
SignificantWhitespace. The white space within an xml:space=‘preserve’ scope
Text. The content of the text node
Whitespace. The white space between markup
ProcessingInstruction. The entire content excluding the target
XmlDeclaration. The content of the declaration
Other node types. String.Empty
The XmlTextReader.HasValue property returns true if the current node is a type that returns a value; otherwise, it returns false.
Other information about the current node available through XmlTextReader properties is summarized in Table 11-7.
Table 11-7. Other Node Properties Available Through XmlTextReader
Comments | |
---|---|
AttributeCount | Gets the number of attributes on the current node. |
Valid node types: Element, DocumentType, XmlDeclaration. | |
BaseURI | Gets a String containing the base Uniform Resource Identifier (URI) of the current node. |
Valid node types: All. | |
CanResolveEntity | Always returns false for an XmlTextReader. See the Unimplemented Members section later in this chapter for details. |
Depth | Gets the depth of the current node in the XML source. |
Valid node types: All. | |
HasAttributes | Returns true if the current node has attributes. Will always return false for element types other than Element, DocumentType, and XmlDeclaration. |
IsEmptyElement | Returns true if the current node is an empty Element type ending in /> (for example: <SomeElement/>). For all other node types and nonempty Element nodes, IsEmptyElement returns false. |
LineNumber | Gets the current line number of the XML source. Line numbers begin at 1. |
Not inherited from XmlReader; provides implementation of the System.Xml.IXmlLineInfo.LineNumber interface member. | |
Valid node types: All. | |
LinePosition | Gets the current line position of the XML source. Line positions begin at 1. |
Not inherited from XmlReader; provides implementation of the System.Xml.IXmlLineInfo.LinePosition interface member. | |
Valid node types: All. | |
NamespaceURI | Gets the namespace URI of the current node. |
Valid node types: Element and Attribute. | |
QuoteChar | Gets the quotation mark character used to enclose the value of an Attribute node. For nonattribute nodes, QuoteChar always returns a double quotation mark ("). |
Operations that change the location of the cursor are collectively referred to as read operations. All read operations move the cursor relative to the current node. Note that while attributes are nodes, they are not returned as part of the normal node stream and never become the current node using the read operations discussed in this section; accessing attribute nodes is covered in the following section.
The simplest cursor operation is the XmlTextReader.Read method, which attempts to move the cursor to the next node in the XML source and returns true if successful. If there are no further nodes, Read returns false. The following code fragment visits every node in an XmlTextReader and displays the node name on the console:
XmlTextReader rdr = new XmlTextReader("MyXmlFile.xml"); while (rdr.Read()) { System.Console.WriteLine("Inspecting node : {0}", rdr.Name); }
Using the Read method is the simplest way to process the nodes in an XML document but is often not the desired behavior because all nodes, including noncontent and EndElement nodes, are returned.
The MoveToContent method determines whether the current node is a content node. Content nodes include the following node types: Text, CDATA, Element, EndElement, EntityReference, and EndEntity. If the current node is a noncontent node, the cursor skips over all nodes until it reaches a content node or the end of the XML source. The MoveToContent method returns the XmlNodeType of the new current node (XmlNodeType.None if the end of the input is reached).
The Skip method causes the cursor to be moved to the next sibling of the current node; all child nodes will be ignored. For nodes with no children, the Skip method is equivalent to calling Read.
The IsStartElement method calls MoveToContent and returns true if the current node is a start tag or an empty element. Overloaded versions of the IsStartElement method support the provision of a name or local name and a namespace URI; if names are specified, the method will return true if the new current node name matches the specified name. The following code fragment demonstrates the use of IsStartElement to display the name of each element start tag:
XmlTextReader rdr = new XmlTextReader("MyXmlFile.xml"); while (rdr.Read()) { if (rdr.IsStartElement()) { System.Console.WriteLine("Inspecting node : {0}", rdr.Name); } }
The ReadStartElement method calls IsStartElement followed by the Read method; if the result of the initial call to IsStartElement is false, an XmlException is thrown. ReadStartElement provides the same set of overloads as IsStartElement, allowing an element name to be specified.
The ReadEndElement method checks that the current node is an end tag and then advances the cursor to the next node; an XmlException is thrown if the current node is not an end tag.
The XmlTextReader also includes a number of methods to return content from the current node and its descendants; these are summarized in Table 11-8.
Table 11-8. XmlTextReader Methods That Return Content from the Current Node
Method | Comments |
---|---|
ReadInnerXml() | Returns a String containing the raw content (including markup) of the current node. The start and end tags are excluded. |
ReadOuterXml() | The same as ReadInnerXml except that the start and end tags are included. |
ReadString() | Returns the contents of an element or text node as a String. Nonelement and text nodes return String.Empty. The cursor is not moved. |
ReadChars() | Reads the text contents (including markup) of an element node into a specified char array a section at a time and returns the number of characters read. Subsequent calls to ReadChars continue reading from where the previous call finished. ReadChars returns 0 when no more content is available. Nonelement nodes always return 0. The cursor is not moved. |
ReadBase64() | Like ReadChars but reads and decodes Base64-encoded content. |
ReadBinHex() | Like ReadChars but reads and decodes BinHex-encoded content. |
Three types of node support attributes: Elements, XmlDeclarations, and DocumentType declarations. The XmlTextReader doesn’t treat attributes as normal nodes; attributes are always read as part of the containing node.
The XmlTextReader class offers two mechanisms to access the attributes of the current node; we discuss both approaches in the following sections.
The attribute values of the current node can be accessed through the XmlTextReader class using both methods and indexers; the attribute is specified either by name or by index position. A URI can be specified for attributes contained in a namespace. The members used to access attribute values are summarized in Table 11-9.
Table 11-9. XmlTextReader Members Used to Access Attribute Values
Comment | |
---|---|
<XmlTextReader>[int]GetAttribute(int) | Indexer and method alternative that get the value of an attribute based on its index. |
<XmlTextReader>[String]GetAttribute(String) | Indexer and method alternative that get the value of an attribute by name. |
<XmlTextReader>[String, String]GetAttribute(String, String) | Indexer and method alternative that get the value of an attribute in a specific namespace by name. |
The XmlTextReader class provides support for accessing attributes as independent nodes, allowing attribute information to be accessed using the XmlTextReader class properties described earlier in this section.
If the current node has attributes, the XmlTextReader.MoveToFirstAttribute method returns true and moves the cursor to the first attribute of the current node; the first attribute becomes the current node. If the current node has no attributes, the method returns false and the cursor remains unmoved.
Once the cursor is positioned on an attribute node, calling MoveToNextAttribute will move it to the next attribute node. If another attribute exists, the method will return true; otherwise, the method returns false and the position of the cursor remains unchanged. If the current node is not an attribute, but a node with attributes, the MoveToNextAttribute method has the same effect as MoveToFirstAttribute.
The MoveToAttribute method provides three overloads for moving the cursor directly to a specific attribute node. These overloads are summarized in Table 11-10.
Table 11-10. The Overloaded Versions of the MoveToAttribute Method
Method | Comments |
---|---|
MoveToAttribute(int) | Moves the cursor to the attribute node at the specified index. If there is no attribute at the specified index, an ArgumentOutOfRangeException is thrown and the cursor is not moved. |
MoveToAttribute(String) | Moves the cursor to the attribute node with the specified name. This method returns true if the named attribute exists; otherwise, it returns false and the cursor doesn’t move. |
MoveToAttribute(String, String) | Same as MoveToAttribute(String) but also allows a namespace URI to be specified for the target attribute. |
The MoveToElement method moves the cursor back to the node containing the attributes.
The following example demonstrates attribute access using both direct and node access. The example class parses an XML file loaded from the http://localhost/test.xml URL and determines whether any element type node contains an attribute named att1. If the att1 attribute is present, the program displays its value on the console; otherwise, the program displays a list of all attributes and values of the current element.
XML Input (assumed to be located at http://localhost/test.xml):
<?xml version='1.0'?> <root> <node1 att1='abc' att2='def'/> <node2 att2='ghi' att3='jkl' att4='mno'/> <node3 att1='uvw' att2='xyz'/> </root>
Example code:
using System; using System.Xml; public class xmltest { public static void Main () { String myFile = "http://localhost/test.xml"; XmlTextReader rdr = new XmlTextReader(myFile); while (rdr.Read()) { if (rdr.NodeType == XmlNodeType.Element) { Console.WriteLine("Inspecting node : {0}",rdr.Name); if (rdr["att1"] != null){ Console.WriteLine(" att1 = {0}",rdr["att1"]); } else { while(rdr.MoveToNextAttribute()) { Console.WriteLine(" {0} = {1}", rdr.Name, rdr.Value); } rdr.MoveToElement(); } } } } }
Inspecting node : root Inspecting node : node1 att1 = abc Inspecting node : node2 att2 = ghi att3 = jkl att4 = mno Inspecting node : node3 att1 = uvw
The XmlTextReader.GetRemainder method returns a System.IO.TextReader containing the remaining XML from a partially parsed source. Following the GetRemainder call, the XmlTextReader.ReadState property is set to EOF.
Instances of XmlTextReader should be closed using the Close method. This releases any resources used while reading and sets the ReadState property to the value ReadState.Closed.
Because XmlTextReader doesn’t validate XML, the IsDefault, CanResolveEntity, and ResolveEntity members inherited from XmlReader exhibit default behavior as described in Table 11-11.
Table 11-11. The Default Behavior of Unimplemented XmlReader Methods in XmlTextReader
Member | Comments |
---|---|
IsDefault | Always returns false. XmlTextReader doesn’t expand default attributes defined in schemas. |
CanResolveEntity | Always returns false. XmlTextReader cannot resolve entity references. |
ResolveEntity() | Throws a System.InvalidOperationException. XmlTextReader cannot resolve general entity references. |
The XmlValidatingReader class is a concrete implementation of XmlReader that validates an XML source against one of the following:
Document type definitions as defined in the W3C Recommendation "Extensible Markup Language (XML) 1.0"
MSXML Schema specification for XML-Data Reduced (XDR) schemas
XML Schema as defined in the W3C Recommendations "XML Schema Part 0: Primer," "XML Schema Part 1: Structures," and "XML Schema Part 2: Datatypes," collectively referred to as XML Schema Definition (XSD)
The functionality of XmlValidatingReader is predominantly the same as XmlTextReader, described in the "XmlTextReader" section earlier in this chapter. However, XmlValidatingReader includes a number of new members and some members that operate differently than in XmlTextReader; these differences are the focus of this section.
The most commonly used XmlValidatingReader constructor takes an XmlReader instance as the source of XML. The following statements demonstrate the creation of an XmlValidatingReader from an XmlTextReader:
XmlTextReader rdr = new XmlTextReader("SomeXmlFile.xml"); XmlValidatingReader vRdr = new XmlValidatingReader(rdr);
The XmlValidatingReader.ValidationType property gets and sets the type of validation the reader will perform. This property must be set before execution of the first read operation; otherwise, an InvalidOperationException will be thrown.
The ValidationType property must be set to a value from the ValidationType enumeration; Table 11-12 summarizes the available values.
Table 11-12. The ValidationType Enumeration
Value | Comments |
---|---|
Auto | Validates based on the DTD or schema information the parser finds. This is the default value. |
DTD | Validates according to a DTD. |
None | Performs no validation. The only benefit of XmlValidatingReader in this mode is that general entity references can be resolved and default attributes are reported. |
Schema | Validates according to an XSD schema. |
XDR | Validates according to an XDR schema. |
If the ValidationType is set to Auto, DTD, Schema, or XDR and validation errors occur when parsing an XML document, an XmlSchemaException is thrown and parsing of the current node stops. Parsing cannot be resumed once an error has occurred.
Alternatively, the ValidationEventHandler member of XmlValidatingReader allows the programmer to specify a delegate that is called to handle validation errors, suppressing the exception that would be raised. The arguments of the delegate provide access to information about the severity of the validation error, the exception that would have occurred, and a textual message associated with the error.
Use of the ValidationEventHandler member allows the programmer to determine whether to resume or terminate the parser.
The read-only XmlValidatingReader.Schemas property can be used in conjunction with the XmlSchemaCollection class to cache XSD and XDR schemas in memory, saving the reader from having to reload schema files. However, XmlValidatingReader doesn’t automatically cache schemas; any caching must be explicitly performed by the programmer. Once cached, schemas cannot be removed from an XmlSchemaCollection.
The XmlValidatingReader maintains an XmlSchemaCollection that is accessed via the Schemas property; the most common way to add new schema files to the collection is by using the Add method. The XmlSchemaCollection class implements the ICollection and IEnumerable interfaces and provides indexer access to schemas based on a namespace URI.
An important feature of the XmlSchemaCollection is the ValidationEventHandler event; this member is unrelated to the ValidationEventHandler member of the XmlValidatingReader class. This event specifies a method called to handle errors that occur when validating a schema loaded into the collection. XmlSchemaCollection throws an XmlSchemaException if no event handler is specified.
The following example demonstrates the steps necessary to configure the XmlSchemaCollection validation event handler and to cache a schema.
using System; using System.Xml; using System.Xml.Schema; public class schematest { public static void Main() { // Create the validating reader XmlTextReader rdr = new XmlTextReader("MyXmlDocument.xml"); XmlValidatingReader valRdr = new XmlValidatingReader(rdr); // Get the schema collection from the validating reader XmlSchemaCollection sCol = valRdr.Schemas; // Set the validation event handler for the schema collection sCol.ValidationEventHandler += new ValidationEventHandler(ValidationCallBack); // Cache a schema in the schema collection sCol.Add("urn:mynamespace","myschema.xsd"); } // Create handler for validation events public static void ValidationCallBack(object sender, ValidationEventArgs args) { Console.WriteLine("Schema error : " + args.Exception.Message); } }
As mentioned at the start of this section, XmlValidatingReader has a number of new members or members with behavior different from that of the members in XmlTextReader; these members are summarized in Table 11-13.
Table 11-13. Differences Between XmlValidatingReader and XmlTextReader
Member | Comments |
---|---|
Different | |
CanResolveEntity | Always returns true. |
IsDefault | Returns true if the current node is an attribute whose value was generated from a default specified in a DTD or a schema. |
LineNumber | While XmlValidatingReader implements the IXmlLineInfo interface, explicit interface implementation has been used to implement the LineNumber property. The XmlValidatingReader must be explicitly cast to an IXmlLineInfo type before LineNumber can be called. |
LinePosition | Same as LineNumber. |
ResolveEntity() | This method resolves the entity reference if the current node is an EntityReference. |
Reader | Returns the XmlReader used to instantiate the XmlValidatingReader. |
The System.Xml.XmlNodeReader class is a concrete implementation of XmlReader that provides read-only, forward-only cursor style access to a Document Object Model (DOM) node or subtree.
The XmlNodeReader provides predominantly the same behavior and functionality as the XmlTextReader described in the "XmlTextReader" section earlier in this chapter. However, XmlNodeReader offers a single constructor with the following signature:
public XmlNodeReader(XmlNode node);
The node argument provides the root element of the XmlNodeReader. Given that XmlDocument derives from XmlNode, the XmlNodeReader can be used to navigate a partial or full DOM tree.
The following code fragment demonstrates the creation of an XmlNodeReader using an XmlDocument as the source. We then iterate through the nodes and display the names of any Element type nodes on the console.
XmlDocument doc = new XmlDocument(); doc.Load("SomeXmlFile.xml"); XmlNodeReader rdr = new XmlNodeReader(doc); while (rdr.Read()) { if (rdr.NodeType == XmlNodeType.Element) { System.Console.WriteLine("Node name = {0}", rdr.LocalName); } }