There are occasions where you will need to join two datasets. If you are working with databases, you could use SQL statements to perform this task, but for other kinds of input (XML, text, Excel), you will need another solution.
Kettle provides the Merge Join step to join data coming from any kind of source.
Let's assume that you are building a house and want to track and manage the costs of building it. Before starting, you prepared an Excel file with the estimated costs for the different parts of your house. Now, you are given a weekly file with the progress and the real costs. So, you want to compare both to see the progress.
To run this recipe, you will need two Excel files, one for the budget and another with the real costs. The budget.xls
has the estimated starting date, estimated end date, and cost for the planned tasks. The costs.xls
has the real starting date, end date, and cost for tasks that have already started.
You can download the sample files from the book's site.
Carry out the following steps:
dd/MM/yyyy
to the fields of type Date and $0.00
to the fields with costs. task, starting date (est.), starting date, end date (est.), end date, cost (est.), cost
.In the example, you saw how to use the Merge Join step to join data coming from two Excel files. You can use this step to join any other kind of input.
In the Merge Join step, you set the name of the incoming steps, and the fields to use as the keys for joining them. In the recipe, you joined the streams by just a single field: the task
field.
In the example, you set the Join Type to LEFT OUTER JOIN
. Let's see explanations of the possible join options:
Join |
Description |
In the example |
---|---|---|
|
The result contains only the rows with the same key in both sources |
You will obtain only the tasks that have estimated and real information. |
|
The result contains all the rows from the first source, and the correspondent values for second source (or empty values for non-matching keys) |
You will obtain all the tasks from the budget, and the real costs related to these tasks (with empty values for the tasks that still haven't any associated costs). |
|
The result contains all the rows from the second source, and the corresponding values for the first source (or empty values for non-matching keys) |
You will obtain all the real tasks' costs and their related information from the budget. If there is a cost for a task that hadn't been estimated, the estimated cost will be empty. |
|
The result contains all the rows from both sources (with empty values for non-matching keys) |
You will obtain all the tasks from the budget and the real costs. This was the case in the recipe. |
The recipe named Reading an Excel file in Chapter 2, Reading and Writing Files. In this recipe, you can learn the details of reading Excel files with Kettle.