Specifying fields by using XPath notation

If you intend to read or write XML structures, it's mandatory that you understand at least the basics of XPath, the language for finding information in an XML document, or defining parts of an XML document. In this recipe, you will be introduced to the XPath notation, so that you will find it easier to work with the rest of the recipes in the chapter.

Suppose you have an XML structure such as the following:

<w_cond>
<data>
<request>
<type>City</type>
<query>Buenos Aires, Argentina</query>
</request>
<current_condition>
<observation_time>08:12 PM</observation_time>
<temp scale="C">19</temp>
<temp scale="F">66</temp>
<weatherDesc>Sunny</weatherDesc>
<windspeed unit="Miles">8</windspeed>
<windspeed unit="Kmph">13</windspeed>
<dirDegree>70</dirDegree>
<dir16Point>ENE</dir16Point>
...
</current_condition>
<weather>
<date>2010-10-24</date>
<tempMaxC>23</tempMaxC>
...
</weather>
<weather>
<date>2010-10-25</date>
...
</weather>
...
</data>
<data>
<request>
<type>City</type>
<query>Montevideo, Uruguay</query>
</request>
...
</data>
<data>
...
</data>
...
</w_cond>

This structure contains the weather forecast for a group of cities. For each city, you have the current weather and the forecast for the next three days.

Note

The sample XML was obtained by using a free local weather API. To learn how to use that API, visit www.worldweatheronline.com. Note that the sample is a slightly modified version of the original result.

Now, you want to specify the XPath for the following data (highlighted in the sample structure):

  • City
  • Observation time
  • Temperature (scale and degrees)
  • Weather description

Getting ready

This recipe is theoretical and has the purpose of helping you when it's time to enter an XPath notation. You will not develop a transformation here. However, for a better understanding of what's being explained, you can do the following:

  1. Download the sample XML file.
  2. Read it by using the Get data from XML step.
  3. Try introducing the different XPath notations in the Fields grid, as they are explained.
  4. To check if you are entering the correct notations, do a preview and check it for yourself.

How to do it...

Carry out the following steps:

  1. Pick the node that will be the base for specifying your fields. In the sample data, the node will be /w_cond/data.

    For each desired element, repeat steps 2 and 3:

  2. Look at the XML structure to see if it is a node or an attribute. In the sample data, the temperature scale and the units for the windspeed are attributes. The rest of the fields are nodes.
  3. Identify the absolute location of the element, that is, the complete path from the root element to the desired element. For example, for city the absolute location would be /w_cond/data/request/query. If the element is an attribute, prepend @ to the name.
  4. Identify the location relative to the base node identified in step 1. For example, for the city the relative location would be request/query. If the element is an attribute, prepend @ to the name.

The following table shows the absolute and relative locations for the sample data:

data

absolute location

relative location

city

/w_cond/data/request/query

request/query

observation time

/w_cond/data/current_cond/observation_time

current_cond/observation_time

Temperature (degrees)

/w_cond/data/current_cond/temp

current_cond/temp

Temperature (scale)

/w_cond/data/current_cond/temp/@scale

current_cond/temp/@scale

Weather description

/w_cond/data/current_cond/weatherDesc

current_cond/weatherDesc

The preceding locations are the XPath notations for the selected data in the sample XML structure.

How it works...

XPath is a set of rules used for getting information from an XML document. XPath treats an XML structure as a tree of nodes. The tree can be compared to a directory tree in your system. The way you specify relative or absolute locations in that tree is much the same in both cases.

In Kettle you use XPath both for getting data from XML structures and for generating XML structures.

The reason for specifying both absolute and relative locations in the recipe is that in Kettle you need one or the other depending of what you are doing. For example when you read an XML structure you have to select a node, and define the fields as locations relative to that node. When you join two XML structures, the XPath statement that you need to specify is an absolute location.

There's more...

The XPath notations in the recipe are the simplest XPath notations you will find, but XPath allows you to write really complex expressions. The next sections provide you with more detail about specifying nodes with XPath notation. For more information on XPath, you can follow this link: http://www.w3schools.com/XPath/ or see the W3C recommendation: http://www.w3.org/TR/xpath.

Getting data from a different path

When you read an XML structure, you don't specify absolute paths, but paths relative to a node selected as the current node. In the sample recipe, the current node was /w_cond/data. If the fields are inside that node, you get the relative location just by cutting the root part from the absolute location. For example, the absolute path for the weather description is /w_cond/data/current_cond/weatherDesc.

Then, for getting the location relative to the current node, just cut /w_cond/data/ and you get current_cond/weatherDesc.

If the data you need is not in the tree below the selected node, you have to use the .. notation, which is used to specify the parent of the current node. For example, suppose that the current node is /w_cond/data/current_cond and you want to know the name of the city to which this condition belongs. The city element is not inside the selected node. To reach it, you have to type ../request/city.

Getting data selectively

If you are reading a structure where there might be more than one element with the same XPath notation, you have the option to select just the one that interests you. Look for example at the temperature elements in the sample structure:

<temp scale="C">19</temp>
<temp scale="F">66</temp>

These lines belong to the Celsius and the Fahrenheit scales respectively. Both lines share the same XPath notation. Suppose that you are interested in the Celsius line. To get that element, you have to use a predicate. A predicate is an expression used to find a specific node or a node that contains a specific value. In this case, you need a predicate to find a node that contains an attribute named scale with value C. The notation for getting that node is temp[@scale='C']. In general, for getting a node that contains a specific value, the notation is XPath[condition], that is, the XPath expression followed by the condition within brackets.

Now, let's make it a bit more complicated. Suppose that you don't even know which scale to return, because the scale is part of the XML structure, as shown in the following example:

<request>
<type>City</type>
<query>Buenos Aires, Argentina</query>
<preferredScale>C</preferredScale>
</request>

Each city will have its own preferred scale and you should return the temperature in Celsius or Fahrenheit depending on the city's preferred scale.

What you need is a dynamic predicate. The way to implement this is through the use of a non standard extension named Tokens. Let's explain it based on our example:

  1. The first thing you have to do is to add the field in which the token is based: preferredScale. So, add a field named preferredScale and for XPath, type:

    ../request/preferred_scale.

  2. Then, add a new field for the temperature in the desired scale. For Name, type temperature and as XPath type:

    ../temp[@scale =@_preferredScale-]/text()

  3. Finally, under the Content tab, check Use tokens. If you don't, this will not work!

Assuming that you defined the fields: city, preferredScale, temperature_C, and temperature_F for the temperature in Celsius and Fahrenheit degrees respectively, and temperature, if you do a preview you should see something like the following:

Getting data selectively

In general, the expression for a token is @_<tokenized_field>-, where<tokenized_field> is the field in which the token is based and has to be previously defined.

PDI will build a dynamic predicate by replacing each<tokenized_field> by its current value and then returning the proper node value.

Getting more than one node when the nodes share their XPath notation

Look at the weather nodes in the sample XML structure:

<weather>
<date>2010-10-24</date>
<tempMaxC>23</tempMaxC>
...
</weather>
<weather>
<date>2010-10-25</date>
<tempMaxC>23</tempMaxC>
...
</weather>
<weather>
<date>2010-10-26</date>
<tempMaxC>24</tempMaxC>

For each node /w_cond/data (the current node in the example), there are three different weather nodes.

Suppose that you want to read all of them. In this case, you have to use a predicate just as explained in the previous section. In this case the predicate is not used to find a node that contains a specific value, but to find the nodes by position: You need the first, second, and third weather nodes. The notation that you have to use is XPath[position], that is, the XPath expression followed by the position of the desired element within brackets.

In the example, the notation for the first, second, and third weather nodes would be weather[1], weather[2], and weather[3] respectively. For getting nodes inside those, the notation is as usual. For example, for getting the date of the second node, you should write weather[2]/date.

Note that if you are reading an XML structure, each element that you get by using this notation may be used as a new column in the dataset. If, instead of that you want to generate a different row for each weather node, then you should take another approach: Instead of using this notation, simply change the current node (Loop XPath element) from /w_cond/data to /w_cond/data/weather.

Saving time when specifying XPath

In most of the Kettle steps where you have to provide an XPath, you must type it manually. That's why you need to understand this notation. However, when you read an XML structure by using the Get Data from XML step, you have the option to use the Get Fields button to get the nodes and attributes automatically. Note that Kettle will get only the trivial elements, that is:

  • It will get only the fields that are below the node you typed as Loop XPath.
  • It will bring all the elements. For getting elements selectively, as in the temperature example above, you'll have to modify the grid manually.
  • If there is more than one element with the same XPath notation, as in the weather example above, it will bring only the first element.

To summarize, if you fill the grid with the Get Fields button, you will save time but on most occasions, you will still have to adjust the data in the grid manually.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset