Kettle allows you to compare files and folders through the following job entries: File Compare and Compare folder. In this recipe, you will use the first of those entries, which is used for comparing the content of two files. Assume that periodically you receive a file with new museums data to incorporate into your database. You will compare the new and the previous version of the file. If the files are equal, you do nothing, but if they are different, you will read the new file.
To create and test this recipe, you will need two files: the older version of the museum file (LastMuseumsFileReceived.xml), and the new file (NewMuseumsFileReceived.xml).
On the book's website, you will find sample files to play with. In particular, NewMuseumsFileReceived(equal).xml
is equal to the LastMuseumsFileReceived.xml
file, and NewMuseumsFileReceived(different).xml
, as implied by its name, is different. With these files, you will be able to test the different situations in the recipe.
Carry out the following steps:
Evaluation
item and then select the Follow when result is false
item. Evaluation
item, and this time select the Follow when result is true
item. ${Internal.Transformation.Filename.Directory}sampleFilesNewMuseumsFileReceived.xml
. Use /museums/museum
in the Loop XPath textbox under the Content tab, and use the Get fields button under the Fields tab to populate the list of fields automatically.2010/11/05 10:08:46 - fileCompare - Finished job entry [DUMMY] (result=[true])
... ... - Read XML file - Loading transformation from XML file [file:///C:/readXMLFile.ktr] ... - readXMLFile - Dispatching started for transformation [readXMLFile] ... - readXMLFile - This transformation can be replayed with replay date: 2010/11/05 10:14:10 ... - Read museum data.0 - Finished processing (I=4, O=0, R=0, W=4, U=0, E=0) ... - fileCompare - Finished job entry [Read XML file] (result=[true]) ...
The File Compare job entry performs the comparison task. It verifies whether the two files have the same content. If they are different, the job entry fails. Then, the job proceeds with the execution of the transformation that reads the new file. However, if the files are the same, the job entry succeeds and the flows continue to the DUMMY entry.
In other words, the new file is processed if and only if the File Compare fails, that is, if the two files are different.
Besides comparing files with Kettle, you can also compare directories; let's see how it works.
If you want to compare the contents of two folders, you can use the Compare folder job entry from the File management category.
In this job entry, you must browse to or type the complete paths of the two folders in the File / Folder name 1 and File / Folder name 2 textboxes respectively, and configure the comparison to be done. See the possible settings in the following screenshot:
The Compare option, set to All by default, can be changed to compare just files, just folders or just the files indicated by a regular expression. The usual requirement would be to compare the list of files and then their sizes.