If you intend to read or write XML structures, it's mandatory that you understand at least the basics of XPath, the language for finding information in an XML document, or defining parts of an XML document. In this recipe, you will be introduced to the XPath notation, so that you will find it easier to work with the rest of the recipes in the chapter.
Suppose you have an XML structure such as the following:
<w_cond> <data> <request> <type>City</type> <query>Buenos Aires, Argentina</query> </request> <current_condition> <observation_time>08:12 PM</observation_time> <temp scale="C">19</temp> <temp scale="F">66</temp> <weatherDesc>Sunny</weatherDesc> <windspeed unit="Miles">8</windspeed> <windspeed unit="Kmph">13</windspeed> <dirDegree>70</dirDegree> <dir16Point>ENE</dir16Point> ... </current_condition> <weather> <date>2010-10-24</date> <tempMaxC>23</tempMaxC> ... </weather> <weather> <date>2010-10-25</date> ... </weather> ... </data> <data> <request> <type>City</type> <query>Montevideo, Uruguay</query> </request> ... </data> <data> ... </data> ... </w_cond>
This structure contains the weather forecast for a group of cities. For each city, you have the current weather and the forecast for the next three days.
The sample XML was obtained by using a free local weather API. To learn how to use that API, visit www.worldweatheronline.com. Note that the sample is a slightly modified version of the original result.
Now, you want to specify the XPath for the following data (highlighted in the sample structure):
This recipe is theoretical and has the purpose of helping you when it's time to enter an XPath notation. You will not develop a transformation here. However, for a better understanding of what's being explained, you can do the following:
Carry out the following steps:
/w_cond/data
.For each desired element, repeat steps 2 and 3:
city
the absolute location would be /w_cond/data/request/query
. If the element is an attribute, prepend @
to the name. request/query
. If the element is an attribute, prepend @
to the name.The following table shows the absolute and relative locations for the sample data:
data |
absolute location |
relative location |
---|---|---|
city |
|
|
observation time |
|
|
Temperature (degrees) |
|
|
Temperature (scale) |
|
|
Weather description |
|
|
The preceding locations are the XPath notations for the selected data in the sample XML structure.
XPath is a set of rules used for getting information from an XML document. XPath treats an XML structure as a tree of nodes. The tree can be compared to a directory tree in your system. The way you specify relative or absolute locations in that tree is much the same in both cases.
In Kettle you use XPath both for getting data from XML structures and for generating XML structures.
The reason for specifying both absolute and relative locations in the recipe is that in Kettle you need one or the other depending of what you are doing. For example when you read an XML structure you have to select a node, and define the fields as locations relative to that node. When you join two XML structures, the XPath statement that you need to specify is an absolute location.
The XPath notations in the recipe are the simplest XPath notations you will find, but XPath allows you to write really complex expressions. The next sections provide you with more detail about specifying nodes with XPath notation. For more information on XPath, you can follow this link: http://www.w3schools.com/XPath/ or see the W3C recommendation: http://www.w3.org/TR/xpath.
When you read an XML structure, you don't specify absolute paths, but paths relative to a node selected as the current node. In the sample recipe, the current node was /w_cond/data
. If the fields are inside that node, you get the relative location just by cutting the root part from the absolute location. For example, the absolute path for the weather description is /w_cond/data/current_cond/weatherDesc
.
Then, for getting the location relative to the current node, just cut /w_cond/data/
and you get current_cond/weatherDesc
.
If the data you need is not in the tree below the selected node, you have to use the .
. notation, which is used to specify the parent of the current node. For example, suppose that the current node is /w_cond/data/current_cond
and you want to know the name of the city to which this condition belongs. The city
element is not inside the selected node. To reach it, you have to type ../request/city
.
If you are reading a structure where there might be more than one element with the same XPath notation, you have the option to select just the one that interests you. Look for example at the temperature elements in the sample structure:
<temp scale="C">19</temp> <temp scale="F">66</temp>
These lines belong to the Celsius and the Fahrenheit scales respectively. Both lines share the same XPath notation. Suppose that you are interested in the Celsius line. To get that element, you have to use a predicate. A predicate is an expression used to find a specific node or a node that contains a specific value. In this case, you need a predicate to find a node that contains an attribute named scale
with value C
. The notation for getting that node is temp[@scale='C']
. In general, for getting a node that contains a specific value, the notation is XPath[condition]
, that is, the XPath expression followed by the condition within brackets.
Now, let's make it a bit more complicated. Suppose that you don't even know which scale to return, because the scale is part of the XML structure, as shown in the following example:
<request> <type>City</type> <query>Buenos Aires, Argentina</query> <preferredScale>C</preferredScale> </request>
Each city will have its own preferred scale and you should return the temperature in Celsius or Fahrenheit depending on the city's preferred scale.
What you need is a dynamic predicate. The way to implement this is through the use of a non standard extension named Tokens. Let's explain it based on our example:
preferredScale
. So, add a field named preferredScale
and for XPath, type:../request/preferred_scale.
temperature
and as XPath type:../temp[@scale =@_preferredScale-]/text()
Assuming that you defined the fields: city, preferredScale, temperature_C
, and temperature_F
for the temperature in Celsius and Fahrenheit degrees respectively, and temperature
, if you do a preview you should see something like the following:
In general, the expression for a token is @_<tokenized_field>-
, where<tokenized_field>
is the field in which the token is based and has to be previously defined.
PDI will build a dynamic predicate by replacing each<tokenized_field>
by its current value and then returning the proper node value.
Look at the weather
nodes in the sample XML structure:
<weather> <date>2010-10-24</date> <tempMaxC>23</tempMaxC> ... </weather> <weather> <date>2010-10-25</date> <tempMaxC>23</tempMaxC> ... </weather> <weather> <date>2010-10-26</date> <tempMaxC>24</tempMaxC>
For each node /w_cond/data
(the current node in the example), there are three different weather
nodes.
Suppose that you want to read all of them. In this case, you have to use a predicate just as explained in the previous section. In this case the predicate is not used to find a node that contains a specific value, but to find the nodes by position: You need the first, second, and third weather
nodes. The notation that you have to use is XPath[position]
, that is, the XPath expression followed by the position of the desired element within brackets.
In the example, the notation for the first, second, and third weather
nodes would be weather[1], weather[2]
, and weather[3]
respectively. For getting nodes inside those, the notation is as usual. For example, for getting the date of the second node, you should write weather[2]/date
.
Note that if you are reading an XML structure, each element that you get by using this notation may be used as a new column in the dataset. If, instead of that you want to generate a different row for each weather
node, then you should take another approach: Instead of using this notation, simply change the current node (Loop XPath element) from /w_cond/data
to /w_cond/data/weather
.
In most of the Kettle steps where you have to provide an XPath, you must type it manually. That's why you need to understand this notation. However, when you read an XML structure by using the Get Data from XML step, you have the option to use the Get Fields button to get the nodes and attributes automatically. Note that Kettle will get only the trivial elements, that is:
To summarize, if you fill the grid with the Get Fields button, you will save time but on most occasions, you will still have to adjust the data in the grid manually.