4.5. Performance Considerations

It is important to consider performance when processing XML documents. XML document processing—handling the document in a pre- or post-processing stage to an application's business logic—may adversely affect application performance because such processing is potentially very CPU, memory, and input/output or network intensive.

Why does XML document processing potentially impact performance so significantly? Recall that processing an incoming XML document consists of multiple steps, including parsing the document; optionally validating the document against a schema (this implies first parsing the schema); recognizing, extracting, and directly processing element contents and attribute values; or optionally mapping these components to other domain-specific objects for further processing. These steps must occur before an application can apply its business logic to the information retrieved from the XML document. Parsing an XML document often requires a great deal of encoding and decoding of character sets, along with string processing. Depending on the API that is used, recognition and extraction of content may consist of walking a tree data structure, or it may consist of intercepting events generated by the parser and then processing these events according to some context. An application that uses XSLT to preprocess an XML document adds more processing overhead before the real business logic work can take place. When the DOM API is used, it creates a representation of the document—a DOM tree—in memory. Large documents result in large DOM trees and corresponding consumption of large amounts of memory. The XML data-binding process has, to some extent, the same memory consumption drawback. Many of these constraints hold true when generating XML documents.

There are other factors with XML document processing that affect performance. Often, the physical and logical structures of an XML document may be different. An XML document may also contain references to external entities. These references are resolved and substituted into the document content during parsing, but prior to validation. Given that the document may originate on a system different from the application's system, and external entities—and even the schema itself—may be located on remote systems, there may be network overhead affecting performance. To perform the parsing and validation, external entities must first be loaded or downloaded to the processing system. This may be a network intensive operation, or require a great deal of input and output operations, when documents have a complex physical structure.

In summary, XML processing is potentially CPU, memory, and network intensive, for these reasons:

  • It may be CPU intensive. Incoming XML documents need not only to be parsed but also validated, and they may have to be processed using APIs which may themselves be CPU intensive. It is important to limit the cost of validation as much as possible without jeopardizing the application processing and to use the most appropriate API to process the document.

  • It may be memory intensive. XML processing may require creating large numbers of objects, especially when dealing with document object models.

  • It may be network intensive. A document may be the aggregation of different external entities that during parsing may need to be retrieved across the network. It is important to reduce as much as possible the cost of referencing external entities.

Following are some guidelines for improving performance when processing XML documents. In particular, these guidelines examine ways of improving the CPU, memory, and input/output or network consumption.

4.5.1. Limit Parsing of Incoming XML Documents

In general, it is best to parse incoming XML documents only when the request has been properly formulated. In the case of a Web service application, if a document is retrieved as a Source parameter from a request to an endpoint method, it is best first to enforce security and validate the meta information that may have been passed as additional parameters with the request.

In a more generic messaging scenario, when a document is wrapped inside another document (considered an envelope), and the envelope contains meta information about security and how to process the inner document, you may apply the same recommendation: Extract the meta information from the envelope, then enforce security and validate the meta information before proceeding with the parsing of the inner document. When implementing a SAX handler and assuming that the meta information is located at the beginning of the document, if either the security or the validation of the meta information fails, then the handler can throw a SAX exception to immediately abort the processing and minimize the overall impact on performance.

4.5.2. Use the Most Appropriate API

It's important to choose the most appropriate XML processing API for your particular task. In this section, we look at the different processing models in terms of the situations in which they perform best and where their performance is limited.

In general, without considering memory consumption, processing using the DOM API tends to be slower than processing using the SAX API. This is because DOM may have to load the entire document into memory so that the document can be edited or data retrieved, whereas SAX allows the document to be processed as it is parsed. However, despite its initial slowness, it is better to use the DOM model when the source document must be edited or processed multiple times.

You should also try to use JAXB whenever the document content has a direct representation, as domain-specific objects, in Java. If you don't use JAXB, then you must manually map document content to domain-specific objects, and this process often (when SAX is too cumbersome to apply—see page 166) requires an intermediate DOM representation of the document. Not only is this intermediate DOM representation transient, it consumes memory resources and must be traversed when mapping to the domain-specific objects. With JAXB, you can automatically generate the same code, thus saving development time, and, depending on the JAXB implementation, it may not create an intermediate DOM representation of the source document. In any case, JAXB uses less memory resources as a JAXB content tree is by nature smaller than an equivalent DOM tree.

When using higher-level technologies such as XSLT, keep in mind that they may rely on lower-level technologies like SAX and DOM, which may affect performance, possibly adversely.

When building complex XML transformation pipelines, use the JAXP class SAXTransformerFactory to process the results of one style sheet transformation with another style sheet. You can optimize performance—by avoiding the creation of in-memory data structures such as DOM trees—by working with SAX events until at the last stage in the pipeline.

As an alternative, you may consider using APIs other than the four discussed previously. JDOM and dom4j are particularly appropriate for applications that implement a document-centric processing model and that must manipulate a DOM representation of the documents.

JDOM, for example, achieves the same results as DOM but, because it is more generic, it can address any document model. Not only is it optimized for Java, but developers find JDOM easy to use because it relies on the Java Collection API. JDOM documents can be built directly from, and converted to, SAX events and DOM trees, allowing JDOM to be seamlessly integrated in XML processing pipelines and in particular as the source or result of XSLT transformations.

Another alternative API is dom4j, which is similar to JDOM. In addition to supporting tree-style processing, the dom4j API has built-in support for Xpath. For example, the org.dom4j.Node interface defines methods to select nodes according to an Xpath expression. dom4j also implements an event-based processing model so that it can efficiently process large XML documents. When Xpath expressions are matched during parsing, registered handlers can be called back, thus allowing you to immediately process and dispose of parts of the document without waiting for the entire document to be parsed and loaded into memory.

When receiving documents through a service endpoint (either a JAX-RPC or EJB service endpoint) documents are parsed as abstract Source objects. As already noted, do not assume a specific implementation—StreamSource, SAXSource, or DOMSource—for an incoming document. Instead, you should ensure that the optimal API is used to bridge between the specific Source implementation passed to the endpoint and the intended processing model. Keep in mind that the JAXP XSLT API does not guarantee that identity transformations are applied in the most effective way. For example, when applying an identity transformation from a DOM tree to a DOM tree, the most effective way is to return the source tree as the result tree without further processing; however, this behavior is not enforced by the JAXP specification.

A developer may also want to implement stream processing for the application so that it can receive the processing requirements as part of the SOAP request and start processing the document before it is completely received. Document processing in this manner improves overall performance and is useful when passing very large documents. Extreme caution should be taken if doing this, since there is no guarantee that the underlying JAX-RPC implementation will not wait to receive the complete document before passing the Source object to the endpoint and that it will effectively pass a Source object that allows for stream processing, such as StreamSource or SAXSource. The same holds true when implementing stream processing for outgoing documents. While you can pass a Source object that allows for stream processing, there is no guarantee on how the underlying JAX-RPC implementation will actually handle it.

4.5.3. Choose Effective Parser and Style Sheet Implementations

Each parser and style sheet engine implementation is different. For example, one might emphasize functionality, while another performance. A developer might want to use different implementations depending on the task to be accomplished. Consider using JAXP, which not only supports many parsers and style sheet engines, but also has a pluggability feature that allows a developer to swap between implementations and select the most effective implementation for an application's requirements. When you use JAXP, you can later change the underlying parser implementation without having to change application code.

4.5.3.1. Tune Underlying Parser and Style Sheet Engine Implementations

The JAXP API defines methods to set and get features and properties for configuring the underlying parser and style sheet engine implementations. A particular parser, document builder, or transformer implementation may define specific features and properties to switch on or off specific behaviors dedicated to performance improvement. These are separate from such standard properties and features, such as the http://xml.org/sax/features/validation feature used to turn validation on or off.

For example, Xerces defines a deferred expansion feature called http://apache.org/xml/features/dom/defer-node-expansion, which enables or disables a lazy DOM mode. In lazy mode (enabled by default), the DOM tree nodes are lazily evaluated, their creation is deferred: They are created only when they are accessed. As a result, DOM tree construction from an XML document returns faster since only accessed nodes are expanded. This feature is particularly useful when processing only parts of the DOM tree. Grammar caching, another feature available in Xerces, improves performance by avoiding repeated parsing of the same XML schemas. This is especially useful when an application processes a limited number of schemas, which is typically the case with Web services.

Use care when setting specific features and properties to preserve the interchangeability of the underlying implementation. When the underlying implementation encounters a feature or a property that it does not support or recognize, the SAXParserFactory, the XMLReader, or the DocumentBuilderFactory may throw these exceptions: a SAXNotRecognizedException, a SAXNotSupportedException, or an IllegalArgumentException. Avoid grouping unrelated features and properties, especially standard versus specific ones, in a single try/catch block. Instead, handle exceptions independently so that optional specific features or properties do not prevent switching to a different implementation. You may design your application in such a way that features and properties specific to the underlying implementations may also be defined externally to the application, such as in a configuration file.

4.5.3.2. Reuse and Pool Parsers and Style Sheets

An XML application may have to process different types of documents, such as documents conforming to different schemas. A single parser may be used (per thread of execution) to handle successively documents of different types just by reassigning the handlers according to the source documents to be processed. Parsers, which are complex objects, may be pooled so that they can be reused by other threads of execution, reducing the burden on memory allocation and garbage collection. Additionally, if the number of different document types is large and if the handlers are expensive to create, handlers may be pooled as well. The same considerations apply to style sheets and transformers.

Parsers, document builders, and transformers, as well as style sheets, can be pooled using a custom pooling mechanism. Or, if the processing occurs in the EJB tier, you may leverage the EJB container's instance pooling mechanism by implementing stateless session beans or message-driven beans dedicated to these tasks. Since these beans are pooled by the EJB container, the parsers, document builders, transformers, and style sheets to which they hold a reference are pooled as well.

Style sheets can be compiled into javax.xml.transform.Templates objects to avoid repeated parsing of the same style sheets. Templates objects are thread safe and are therefore easily reusable.

4.5.4. Reduce Validation Cost

Not only is it important, but validation may be required to guarantee the reliability of an XML application. An application may legitimately rely on the parser's validation so that it can avoid double-checking the validity of document contents. Validation is an important step of XML processing, but keep in mind that it may affect performance.

Consider the trusted and reliable system depicted in Figure 4.14. This system is composed of two loosely coupled applications. The front-end application receives XML documents as part of requests and forwards these documents to the reservation engine application, which is implemented as a document-centric workflow.

Figure 4.14. Validating Only When Necessary


Although you must validate external incoming XML documents, you can exchange freely—that is, without validation—internal XML documents or already validated external XML documents. In short, you need to validate only at the system boundaries, and you may use validation internally only as an assertion mechanism during development. You may turn validation off when in production and looking for optimal performance.

In other words, when you are both the producer and consumer of XML documents, you may use validation as an assertion mechanism during development, then turn off validation when in production. Additionally, during production validation can be used as a diagnostic mechanism by setting up validation so that it is triggered by fault occurrences.

4.5.5. Reduce the Cost of Referencing External Entities

Recall that an XML document may be the aggregation of assorted external entities, and that these entities may need to be retrieved across the network when parsing. In addition, the schema may also have to be retrieved from an external location. External entities, including schemas, must be loaded and parsed even when they are not being validated to ensure that the same information is delivered to the application regardless of any subsequent validation. This is especially true with respect to default values that may be specified in an incoming document schema.

There are two complementary ways to reduce the cost of referencing external entities:

  1. Caching using a proxy cache— You can improve the efficiency of locating references to external entities that are on a remote repository by setting up a proxy that caches retrieved, external entities. However, references to external entities must be URLs whose protocols the proxy can handle. (See Figure 4.15, which should be viewed in the context of Figure 4.14.)

    Figure 4.15. An Architecture for Caching External Entities

  2. Caching using a custom entity resolver— SAX parsers allow XML applications to handle external entities in a customized way. Such applications have to register their own implementation of the org.xml.sax.EntityResolver interface with the parser using the setEntityResolver method. The applications are then able to intercept external entities (including schemas) before they are parsed. Similarly, JAXP defines the javax.xml.transform.URIResolver interface. Implementing this interface enables you to retrieve the resources referred to in the style sheets by the xsl:import or xsl:include statements. For an application using a large set of componentized style sheets, this may be used to implement a cache in much the same way as the EntityResolver. You can use EntityResolver and URIResolver to implement:

    • A caching mechanism in the application itself, or

    • A custom URI lookup mechanism that may redirect system and public references to a local copy of a public repository.

You can use both caching approaches together to ensure even better performance. Use a proxy cache for static entities whose lifetime is greater than the application's lifetime. This particularly works with public schemas, which include the version number in their public or system identifier, since they evolve through successive versions. A custom entity resolver may first map public identifiers (usually in the form of a URI) into system identifiers (usually in the form of an URL). Afterwards, it applies the same techniques as a regular cache proxy when dealing with system identifiers in the form of an URL, especially checking for updates and avoiding caching dynamic content. Using these caching approaches often results in a significant performance improvement, especially when external entities are located on the network. Code Example 4.13 illustrates how to implement a caching entity resolver using the SAX API.

Code example 4.13. Using SAX API to Implement a Caching Entity Resolver
import java.util.Map;
import java.util.WeakHashMap;
import java.lang.ref.SoftReference;
import org.xml.sax.*;

public class CachingEntityResolver implements EntityResolver {
   public static final int MAX_CACHE_ENTRY_SIZE = ...;
   private EntityResolver parentResolver;
   private Map entities = new WeakHashMap();

   static private class Entity {
      String name;
      byte[] content;
   }

   public CachingEntityResolver(EntityResolver parentResolver) {
      this.parentResolver = parentResolver;
      buffer = new byte[MAX_CACHE_ENTRY_SIZE];
   }

   public InputSource resolveEntity(String publicId,
      String systemId) throws IOException, SAXException {
      InputStream stream = getEntity(publicId, systemId);
      if (stream != null) {
         InputSource source = new InputSource(stream);
         source.setPublicId(publicId);
         source.setSystemId(systemId);
         return source;
      }
      return null;
   }

   private InputStream getEntity(String publicId, String systemId)
          throws IOException, SAXException {
      Entity entity = null;
      SoftReference reference
         = (SoftReference) entities.get(systemId);
      if (reference != null) {
         // Got a soft reference to the entity,
						// let's get the actual entity.
         entity = (Entity) reference.get();
      }
      if (entity == null) {
         // The entity has been reclaimed by the GC or was
						// never created, let's download it again! Delegate to
						// the parent resolver that implements the actual
						// resolution strategy.
         InputSource source
            = parentResolver.resolveEntity(publicId, systemId);
         if (source != null) {
            return cacheEntity(publicId, systemId,
               source.getByteStream());
         }
         return null;
      }
      return new ByteArrayInputStream(entity.content);
   }
   // Attempts to cache an entity; if it's too big just
						// return an input stream to it.
   private InputStream cacheEntity(String publicId,
         String systemId, InputStream stream) throws IOException {
      stream = new BufferedInputStream(stream);
      int count = 0;
      for (int i = 0; count < buffer.length; count += i) {
         if ((i = stream.read(buffer, count,
            buffer.length - count)) < 0) { break; }
      }
      byte[] content = new byte[count];
      System.arraycopy(buffer, 0, content, 0, count);
      if (count != buffer.length) {
         // Cache the entity for future use, using a soft reference
						// so that the GC may reclaim it if it's not referenced
						// anymore and memory is running low.
         Entity entity = new Entity();
         entity.name = publicId != null ? publicId : systemId;
         entity.content = content;
         entities.put(entity.name, new SoftReference(entity));
         return new ByteArrayInputStream(content);
      }
      // Entity too big to be cached.
      return new SequenceInputStream(
         new ByteArrayInputStream(content), stream);
   }
}

4.5.6. Cache Dynamically Generated Documents

Dynamically generated documents are typically assembled from values returned from calls to business logic. Generally, it is a good idea to cache dynamically generated XML documents to avoid having to refetch the document contents, which entails extra round trips to a business tier. This is a good rule to follow when the data is predominantly read only, such as catalog data. Furthermore, if applicable, you can cache document content (DOM tree or JAXB content tree) in the user's session on the interaction or presentation layer to avoid repeatedly invoking the business logic.

However, you quickly consume more memory when you cache the result of a user request to serve subsequent, related requests. When you take this approach, keep in mind that it must not be done to the detriment of other users. That is, be sure that the application does not fail because of a memory shortage caused by holding the cached results. To help with memory management, use soft references, which allow more enhanced interaction with the garbage collector to implement caches.

When caching a DOM tree in the context of a distributed Web container, the reference to the tree stored in an HTTP session may have to be declared as transient. This is because HttpSession requires objects that it stores to be Java serializable, and not all DOM implementations are Java serializable. Also, Java serialization of a DOM tree may be very expensive, thus countering the benefits of caching.

4.5.7. Use XML Judiciously

Using XML documents in a Web services environment has its pluses and minuses. XML documents can enhance Web service interoperability: Heterogeneous, loosely coupled systems can easily exchange XML documents because they are text documents. However, loosely coupled systems must pay the price for this ease of interoperability, since the parsing that these XML documents require is very expensive. This applies to systems that are loosely coupled in a technical and an enterprise sense.

Contrast this with tightly coupled systems. System components that are tightly coupled can use standard, nondocument-oriented techniques (such as RMI) that are far more efficient in terms of performance and require far less coding complexity. Fortunately, with technologies such as JAX-RPC and JAXB you can combine the best of both worlds. Systems can be developed that are internally tightly coupled and object oriented, and that can interact in a loosely coupled, document-oriented manner.

Generally, when using XML documents, follow these suggestions:

Rely on XML protocols, such as those implemented by JAX-RPC and others, to interoperate with heterogeneous systems and to provide loosely coupled integration points.

Avoid using XML for unexposed interfaces or for exchanges between components that should be otherwise tightly coupled.

Direct Java serialization of domain-specific objects is usually faster than XML serialization of an equivalent DOM tree or even the Java serialization of the DOM tree itself (when the DOM implementation supports such a Java serialization). Also, direct Java serialization of the domain-specific objects usually results in a serialized object form that is smaller than the serialized forms of the other two approaches. The Java serialization of the DOM tree is usually the most expensive in processing time as well as in memory footprint; therefore it should be used with extreme care (if ever), especially in an EJB context where serialization occurs when accessing remote enterprise beans. When accessing local enterprise beans, you can pass DOM tree or DOM tree fragments without incurring this processing expense. Table 4.2 summarizes the guidelines for using XML for component interactions.

Table 4.2. Guidelines for Using XML Judiciously for Component Interaction
Data Passing Between ComponentsRemote ComponentsLocal Components
Java ObjectsEfficientHighly efficient (Fine-grained access}
Document Object ModelVery expensive, nonstandard serializationOnly for document-centric architectures
Serialized XMLExpensive, but interoperableNo reason to serialize XML for local calls

To summarize, when implementing an application which spans multiple containers, keep the following points in mind (also see Table 4.2).

For remote component interaction, Java objects are efficient, serialized XML —although expensive—may be used for interoperability. DOM is very expensive, and Java serialization of DOM trees is not always supported.

For local component interaction, Java objects are the most efficient and DOM may be used when required by the application processing model. However, serialized XML is to be avoided.

Bind to domain-specific Java objects as soon as possible and process these objects rather than XML documents.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset