Suppose that you have two streams with the same structure and want to find out the differences in the data. Kettle has a step meant specifically for that purpose: the Merge Rows (diff) step. In this recipe, you will see how it works.
Suppose that you have a file with information about the fastest roller coasters around the world. Now, you get an updated file and want to find out the differences between the files: There can be new roller coasters in the list; maybe some roller coasters are no longer among the fastest. Besides, you were told that in the old file, there were some errors about the location, country, and year information, so you are also interested in knowing if some of these have changed.
For this recipe, you will need two files with information about roller coasters. You can download them from the book's site.
Both files have the same structure and look like the following:
Roller_Coaster|Speed|park|location|country|Year Kingda Ka|128 mph|Six Flags Great Adventure|Jackson, New Jersey||2005 Top Thrill Dragster|120 mph|Cedar Point|Sandusky, Ohio||2003 Dodonpa|106.8 mph|Fuji-Q Highland|FujiYoshida-shi|Japan|2001 Steel Dragon 2000|95 mph|Nagashima Spa Land|Mie|Japan|2000 Millennium Force|93 mph|Cedar Point|Sandusky, Ohio||2000 ...
For the There's more section, you will also need a database with the first file already loaded in a table. You will find a script for creating and loading it also available for downloading.
Carry out the following steps:
top_roller_coasters.txt
. As a separator, type |
. roller_coaster
and park
. top_roller_coasters_updates.txt
and sort the rows also by roller_coaster
and park
. top_roller_coasters.txt
file. top_roller_coasters_updates.txt
file. flag
.You can save time by clicking on the Get key fields and Get value fields buttons to fill each grid respectively. Then just delete the fields that you don't need.
The Merge Rows (diff) step is used for comparing two streams and finding out the differences between them. The output of the step is a single stream. The output stream contains a new field that acts as a flag indicating the kind of difference found as explained next.
When you use the Merge Rows (diff) step, the two streams you are merging must have the same metadata, that is, the name, order, and type of the fields must be the same.
Let's call the streams being merged reference stream and compare stream. The first holds the old data while the second holds the new data. In the recipe, the old data is the data coming from the top_roller_coasters.txt
file and the new data is the data coming from the top_roller_coasters_update.txt
file.
In order to perform the comparison, you have to tell Kettle how to detect that a row is the same in both streams, that is, you have to specify the key fields. You do it by entering them in the first grid. In the recipe, the key was made up by the roller coaster name and the park name (roller_coaster and park
fields).
If your data comes from a database, instead of using a Sort rows step for sorting the rows, you can sort them in the Table input. That will give you better performance.
Given the two streams, Kettle tries to match rows of both streams based on the key fields provided. Depending on the result, it sets a different value for the flag as explained in the following table:
Result of the comparison |
Flag |
Example |
---|---|---|
The key was only found in the reference stream |
|
|
The key was only found in the compared stream |
|
|
The key was found in both streams and the fields typed in the Value to compare grid are equal |
|
|
The key was found in both streams but at least one of the fields typed in the Value to compare grid is different |
|
|
Note that if a row is found in both streams with identical key fields and compare fields, it is marked as identical even if there are differences in other fields. For example, the Dodonpa
roller coaster has a speed of 106.8 mph
in the reference stream, but a speed of 106 mph
in the compare stream. As you didn't put the speed
in the values to compare list, the rows are marked as identical.
As a final remark, note that for the rows marked as new
or changed
, the values that pass to the output stream are those coming from the compare stream.
For the rows marked as identical
or deleted
, the values that are passed are those coming from the reference stream.
The Merge Rows (diff) step is commonly used together with the Synchronize after merge step to keep a database table updated. The following section shows an example of how to do this.
Suppose that you have a table in a database with information about roller coasters, and that you have already inserted the data in the top_roller_coasters.txt
file in that table.
Now, you have the top_roller_coasters_updates.txt
file and want to update the table based on the differences.
Try the following:
rollercoasters
. Use the following statement:SELECT roller_coaster , speed , park , location , country , year FROM rollercoasters ORDER BY roller_coaster , park
rollercoasters
. Fill the grids as shown in the following screenshot: SELECT
statement to see the data in the rollercoaster
table. The roller coasters with the flag deleted
should have been deleted from the table. The rows with the flag new
should have been inserted in the table, and the rows with the flag changed
should have been updated.The recipe named Inserting, deleting, or updating a table depending on a field in Chapter 1, Working with Databases. This recipe will help you understand the use of the Synchronize after merge step.