Reading simple XML files

PDI has a step named Get XML Data used to read XML structures. This recipe shows how to read an XML file containing the information about museums using this step.

Getting ready

In this exercise, you will use a file named museum.xml with the following structure:

<museums>
<museum id_museum= '…'>
<name>…</name>
<city>…</city>
<country>…</country>
</museum>
</museums>

How to do it...

Carry out the following steps:

  1. Create a new transformation.
  2. Drop a Get XML Data step from the Input category into the canvas.
  3. Under the File tab, you must select the XML document. Browse for the file museums.xml and click on the Add button.
  4. Under the Content tab, type /museums/museum in the Loop XPath textbox. This will be the current node.

    Tip

    Alternatively, you can click on the Get XPath nodes and select it.

  5. Under the Fields tab, you need to specify the fields by using XPath notation. Use the Get Fields button to get them automatically. You should get a result similar to the following:
    How to do it...

    Note

    In the case of XML attributes, if you include the @ character in the XPath as a prefix (for example, @id_museum), then it is not necessary to select Attribute under the Element column.

  6. Doing a preview on this step, you will obtain the following results:
How to do it...

How it works...

The Get XML Data step allows reading data in XML format by using XPath specification. In this recipe, you read a single file. However, as in any input step, you have the option to read a whole directory, multiple files, or even use a regular expression to specify which files to read. Alternatively, you can use this step to read XML structures from other sources, such as fields or URLs. For more details, see the section named XML data in a field later in this recipe.

In order to tell Kettle where to get the fields from, the first thing you have to do is to fill the Loop XPath textbox. You can do that by typing it or by clicking on the Get XPath nodes button and selecting it from the list of available nodes. For generating the dataset, Kettle will loop over the selected node.

Note

For each element that matches the selected node, Kettle will generate a new row.

The XPath and Element columns in the Field grid are the fields used to define the origin of the fields. The XPath should be relative to the current node. The Element column simply tells Kettle if the element is a node or an attribute. The rest of the columns in the grid should be filled just as you would in any input step: providing the type, format, and so on. If you are using this step for reading a file, then you have the option to fill this grid automatically, by clicking on the Get fields button.

There's more...

Most of the time, the XML file includes the encoding type. If none is specified, you have the option of selecting the encoding under the Content tab. For more on encoding, follow the link at http://en.wikipedia.org/wiki/Character_encoding.

If you have large XML files, then see the recommendations at http://wiki.pentaho.com/display/EAI/Get+Data+from+XML+-+Handling+Large+Files.

XML data in a field

In some situations, you don't have the XML as a file, but as a field in your dataset. An example of this is a transformation, where you call a web service that returns the result in XML format. In these situations, instead of specifying the name of the file, you must complete the section XML source from field under the File tab of the Get data from XML step. Checking the option XML source is defined in a field? will enable the drop-down list named get XML source from a field. From that list you have to select the field that contains the data in XML format.

The rest of the tabs should be filled exactly as when you read a file. The main difference is that the Get fields button will not be enabled. Consequently, you will have to fill the grid manually, or follow this tip:

Tip

Copy the content of the field that contains the XML structure and save it in a file. Read that file by using the Get data from XML step, and use the Get fields button to fill the Fields grid automatically. Finally, change the settings under the File tab, in order to read the structure from the desired field.

XML file name in a field

It may happen that your XML structure is in a file, but you don't know its name in advance. If the name of the file is in a field, you still may read it by using the Get data from XML step. For reading the file, you must complete the section XML source from field under the File tab of the Get data from XML step. Check the two options: XML source is defined in a field?, XML source is a filename?. The get XML source from a field drop-down list will be filled with the names of the incoming fields. From that list, select the field that contains the name of the file. As in the previous case—XML data in a field—the Get fields button will be disabled. For advice on filling the Fields grid, read the preceding tip.

ECMAScript for XML

ECMAScript, more commonly known as E4X, is an extension to JavaScript whose main objective is to make it easier to work with XML.

This feature was included in Pentaho Data Integration 3.1. If you know E4X or if you are really confident with scripting, you can take advantage of it and use it to parse or to generate XML structures as a replacement for the XML-related steps.

You can learn more about E4X at the following URL:

http://en.wikipedia.org/wiki/E4X

You can find a tutorial at the following URL:

https://developer.mozilla.org/En/E4X/Processing_XML_with_E4X

See also

The recipe named Specifying fields by using XPath notation. See this recipe for understanding how to fill the fields grid.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset