Comparing files and folders

Kettle allows you to compare files and folders through the following job entries: File Compare and Compare folder. In this recipe, you will use the first of those entries, which is used for comparing the content of two files. Assume that periodically you receive a file with new museums data to incorporate into your database. You will compare the new and the previous version of the file. If the files are equal, you do nothing, but if they are different, you will read the new file.

Getting ready

To create and test this recipe, you will need two files: the older version of the museum file (LastMuseumsFileReceived.xml), and the new file (NewMuseumsFileReceived.xml).

On the book's website, you will find sample files to play with. In particular, NewMuseumsFileReceived(equal).xml is equal to the LastMuseumsFileReceived.xml file, and NewMuseumsFileReceived(different).xml, as implied by its name, is different. With these files, you will be able to test the different situations in the recipe.

How to do it...

Carry out the following steps:

  1. Create a new job, and drop a Start entry into the work area.
  2. Add a File Compare job entry from the File management category. Here you must type or browse to the two files that must be compared, as shown in the following screenshot:
    How to do it...
  3. Add a Transformation job entry and a DUMMY job entry, both from the General category. Create a hop from the File Compare job entry to each of these entries.
  4. Right-click on the hop between the File Compare job entry and the Transformation job entry to show the options, choose the Evaluation item and then select the Follow when result is false item.
  5. Right-click on the hop between the File Compare job entry and the DUMMY job entry, choose the Evaluation item, and this time select the Follow when result is true item.
  6. The job should look like the one shown in the following diagram:
    How to do it...
  7. Then, create a new transformation in order to read the XML file. Drop a Get data from XML step from the Input category into the canvas and type the complete path for the XML file in the File or directory textbox under the File tab. In this case, it is ${Internal.Transformation.Filename.Directory}sampleFilesNewMuseumsFileReceived.xml. Use /museums/museum in the Loop XPath textbox under the Content tab, and use the Get fields button under the Fields tab to populate the list of fields automatically.
  8. Save the transformation.
  9. Configure the Transformation job entry for the main job to run the transformation you just created.
  10. When you run the job, the two files are compared.
  11. Assuming that your files are equal, in the Logging window you will see a line similar to the following:
    2010/11/05 10:08:46 - fileCompare - Finished job entry [DUMMY] (result=[true])
    
  12. This line means that the flow went toward the DUMMY entry.
  13. If your files are different, in the Job metrics window you will see that the fileCompare entry fails, and under the Logging tab, you will see something similar to the following:
    ...
    ... - Read XML file - Loading transformation from XML file [file:///C:/readXMLFile.ktr]
    ... - readXMLFile - Dispatching started for transformation [readXMLFile]
    ... - readXMLFile - This transformation can be replayed with replay date: 2010/11/05 10:14:10
    ... - Read museum data.0 - Finished processing (I=4, O=0, R=0, W=4, U=0, E=0)
    ... - fileCompare - Finished job entry [Read XML file] (result=[true])
    ...
    
  14. This means that the transformation was executed.

How it works...

The File Compare job entry performs the comparison task. It verifies whether the two files have the same content. If they are different, the job entry fails. Then, the job proceeds with the execution of the transformation that reads the new file. However, if the files are the same, the job entry succeeds and the flows continue to the DUMMY entry.

In other words, the new file is processed if and only if the File Compare fails, that is, if the two files are different.

There's more...

Besides comparing files with Kettle, you can also compare directories; let's see how it works.

Comparing folders

If you want to compare the contents of two folders, you can use the Compare folder job entry from the File management category.

In this job entry, you must browse to or type the complete paths of the two folders in the File / Folder name 1 and File / Folder name 2 textboxes respectively, and configure the comparison to be done. See the possible settings in the following screenshot:

Comparing folders

The Compare option, set to All by default, can be changed to compare just files, just folders or just the files indicated by a regular expression. The usual requirement would be to compare the list of files and then their sizes.

Note

Note that you can even compare the content of the files, but that will affect performance considerably.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset