If, How, and Where to Validate

This is a fundamental choice, and you'll benefit by making it as early as possible. We'll assume in this discussion that you're going to use a standard DOM or SAX API that offers validation against W3C schemas. A major choice you need to make is whether or not you are going to have the API validate documents against a schema. This is primarily of concern for documents that your application will consume, but it may also relate to documents that your application will produce. A related decision is the choice of schema, which we'll talk about shortly; you can pick a standard (if you can find an appropriate one) or you can create your own. If you do create your own, you have flexibility about how strictly you write it.

Schema validation has its tradeoffs just like any other design decision. Here are a few.

  • Schema validation may incur performance penalties. We may be concerned about CPU and memory usage on a loaded system, delays in elapsed time, or both. Both can be considerations if you are validating against large schemas, especially if they are spread over several different files. Time delays due to network latency can be a factor if the schemas are standard schemas that must be retrieved from some organization's server on the Internet, though various schema caching mechanisms may ease this. The overall performance penalty may or may not be very great, but I know of at least a few organizations that don't use validation because of performance concerns. A compromise alternative to never validating might be to validate during testing but then to not validate after moving to production if schema validation errors have been eliminated from the normally anticipated production data.

  • The choice to forego validating can make writing code harder. Particularly if you are using the DOM, being able to make reliable assumptions about the structure of a document you are reading can make writing the code much easier. If you aren't able to assume that an Element will be present, you must test for null pointer returns from methods such as get ElementsByTagName or nextSibling before trying to retrieve an Attribute or child Elements from the Element. Users don't like null pointer exceptions. Also, in general terms the code can be trickier to write if you can't be sure that the structure of the document is what you expect. If you get a document that is well formed but has subtrees in the wrong places, what is your application going to do?

  • Schema validation is either on or off; there is no in-between. Some EDI management systems offer the ability to turn on or off the validation of coded values against a code list while still validating for required segments and data elements. It would be nice if the XML APIs offered strict or lax validation and various flavors of validation. However, the XML Schema Recommendation doesn't define levels of validation so the APIs don't either. Most of them stop cold at the first invalid component, no matter how seemingly trivial the noncompliance might be.

  • Schema validation removes the ability to correct errors from within the application (generally). Let me give you an example: Some systems that import sales orders received by EDI hold them in a review state, allowing a clerk to review them and correct for errors or missing data before accepting them into the system. If your application has a similar feature and you validate in the schema for an invalid item number, the data may never reach such a review state. (One way around this limitation would be to load the document without validation and somehow mark it as invalid.)

  • Schema validation removes many validation burdens from application code. This is perhaps one of the greatest benefits to validating against schemas. You let the API validate that your business rules are satisfied, and you don't have to write code for your application to do validation.

  • There are limits to schema validation. Schema validation is very good for determining the presence or absence of data and for ensuring the data type and contents of individual items of data. However, by itself it won't allow you to enforce constraints based on conditional relationships between different data items. Specifically, it won't flag one Element or Attribute as being invalid based on the contents of a different Element or Attribute. Some types of processing are rife with these types of requirements. Let's take an example in health care. A health care claim might have a coded field for the type of medical procedure performed. Some types of procedures might require information in other Elements that isn't relevant to other procedures. A tooth extraction would require that the tooth be identified, whereas fitting a bite guard would not. Some readers may be thinking that this is stupid; problems like this could easily be avoided by modeling the data correctly. You're probably right, but the fact of the matter is that data is not always modeled to prevent the need for this type of validation. It is a very real concern for some applications.

Beyond these considerations, if you do validate you must determine where your schemas are going to be stored. This decision in itself may be relevant to whether or not you validate. For example, if your application is a desktop bookkeeping system that will process standard purchase orders received from customers, you may not want to enforce validation if a significant number of your users rely on dial-up Internet connections. You can't validate if you can't get to the schema. If your bookkeeping system requires that imported orders be validated and if you don't validate, you don't process the order. If you validate against a remote schema, how reliable is the server that hosts the schema? If you provide your own schemas, like any other application component you must decide on a directory location. You must also take steps to ensure that schemas won't be modified or inadvertently deleted by users.

I have one final word on schema validation. Just because a document is schema valid doesn't mean in and of itself that the document is what you expect. The default behavior of most APIs is to validate against the schema referenced in the document's root Element. That schema could be the one you coded for, or it could be something completely unrelated. You could be processing a valid shipment notice when you were expecting a valid invoice. So, to properly use schema validation in your architecture, you'll need to do things beyond just validating against a schema. The first step is to check the root Element name. We routinely did this in several of the book's routines. The other thing is to retrieve and check the document namespace and the schema location Attributes from the document's root Element. There is no way you can be absolutely sure of the document you're processing unless you can match the schema location URI to a known schema. If the document uses a named namespace, verifying that it matches what you expect provides another level of assurance. (Add that to your list of reasons for using namespaces.) I haven't verified namespace names in the book's utilities, but it is something you should consider doing in your applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset