PDI offers different options for validating XML documents, including the validation of a well-formed document. The structure of an XML document is formed by tags that begin with the character< and end with the character>. In an XML document, you can find start-tags:<exampletag>, end-tags:</exampletag>, or empty-element tags:<exampletag/>, and these tags can be nested. An XML document is called well-formed when it follows the following set of rules:
In this recipe, you will learn to validate whether a document is well-formed, which is the simplest kind of XML validation. Assume that you want to extract data from several XML documents with museums information, but only want to process those files that are well-formed.
To use this recipe, you need a set of XML files in a directory named museums
. This recipe reads a directory containing three files, where the first one has an intentional tag mistake. You can download sample files from the book's site.
Carry out the following steps:
2010/11/01 11:56:43 - Check if XML file is well formed - ERROR (version 4.1.0, build 13820 from 2010-08-25 07.22.27 by tomcat) : Error while checking file [file:///C:/museums1.xml]. Exception : 2010/11/01 11:56:43 - Check if XML file is well formed - ERROR (version 4.1.0, build 13820 from 2010-08-25 07.22.27 by tomcat) : org.xml.sax.SAXParseException: Element type "museum" must be followed by either attribute specifications, ">" or "/>".
2010/11/01 11:56:43 - Get XMLs well-formed.0 - Finished processing (I=0, O=0, R=2, W=2, U=0, E=0)
You can use the Check if XML is well-formed job entry to check if one or more XML files are well-formed.
In the recipe, the job validates the XML files from the source directory and creates a list with only the valid XML files.
As you saw in the logging window, only two files were added to the list and used later in the transformation. The first file (C:/museums1.xml) had an error; it was not well-formed and because of that it was not added to the list of files.
The Get files from result step in the transformation get the list of well-formed XML documents created in the job. Then, a Get data from XML step read the files for further processing. Note that in this case, you didn't set the names of the files explicitly, but used the field path coming from the previous step.