Validating well-formed XML files

PDI offers different options for validating XML documents, including the validation of a well-formed document. The structure of an XML document is formed by tags that begin with the character< and end with the character>. In an XML document, you can find start-tags:<exampletag>, end-tags:</exampletag>, or empty-element tags:<exampletag/>, and these tags can be nested. An XML document is called well-formed when it follows the following set of rules:

  • They must contain at least one element
  • They must contain a unique root element this means a single opening and closing tag for the whole document
  • The tags are case sensitive
  • All of the tags must be nested properly, without overlapping

In this recipe, you will learn to validate whether a document is well-formed, which is the simplest kind of XML validation. Assume that you want to extract data from several XML documents with museums information, but only want to process those files that are well-formed.

Getting ready

To use this recipe, you need a set of XML files in a directory named museums. This recipe reads a directory containing three files, where the first one has an intentional tag mistake. You can download sample files from the book's site.

How to do it...

Carry out the following steps:

  1. Create a new job and add a Start entry.
  2. Drop a Check if XML is well formed entry from the XML category into the canvas.
  3. Under the General tab, you must type the path to the museum directory in the File/Folder source textbox, and type .+.xml in the wildcard textbox, in order to use only the files with the .xml extension.
  4. Click on the Add button to populate the File/Folder grid.
  5. Under the Advanced tab, choose the following configuration:
    How to do it...
  6. Then, create a new transformation in order to process the well-formed XML files obtained from the previous job entry. Add this transformation as the last step in the job.
  7. In the transformation, drop a GET files from result step from the Job category.
  8. Add the GET data from XML step.
  9. Under the File tab, set the following:
    How to do it...
  10. Under the Content tab, type /museums/museum in the Loop XPath textbox.
  11. Finally, the grid under the Fields tab must be completed manually, as shown in the following screenshot:
    How to do it...
  12. When you run the job, you will obtain a museums dataset with data coming only from the well-formed XML files. You can take a look at the Logging window to verify this. You will see something like the following:
    2010/11/01 11:56:43 - Check if XML file is well formed - ERROR (version 4.1.0, build 13820 from 2010-08-25 07.22.27 by tomcat) :
    Error while checking file [file:///C:/museums1.xml].
    Exception :
    2010/11/01 11:56:43 - Check if XML file is well formed - ERROR (version 4.1.0, build 13820 from 2010-08-25 07.22.27 by tomcat) : org.xml.sax.SAXParseException:
    Element type "museum" must be followed by either attribute specifications, ">" or "/>".
    
  13. Further, you can see in the Logging window that only two files out of three were read:
    2010/11/01 11:56:43 - Get XMLs well-formed.0 - Finished processing (I=0, O=0, R=2, W=2, U=0, E=0)
    

How it works...

You can use the Check if XML is well-formed job entry to check if one or more XML files are well-formed.

In the recipe, the job validates the XML files from the source directory and creates a list with only the valid XML files.

As you saw in the logging window, only two files were added to the list and used later in the transformation. The first file (C:/museums1.xml) had an error; it was not well-formed and because of that it was not added to the list of files.

The Get files from result step in the transformation get the list of well-formed XML documents created in the job. Then, a Get data from XML step read the files for further processing. Note that in this case, you didn't set the names of the files explicitly, but used the field path coming from the previous step.

See also

  • The recipe named Specifying fields by using XPath notation in this chapter. See this recipe to understand how to fill the Fields grid.
  • The recipe named Validating an XML file against DTD definitions in this chapter. If you want more intelligence in the XML validation process, see this recipe. Also, Validating an XML file against an XSD schema. The same as above, but with a different method for validation.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset