This recipe guides you through starting a PDI job packed in an archived file (.zip
or .tar.gz
) using Kitchen. We will assume that the PDI job to be launched (we will call it the main job) and all the related jobs and transformations called during the execution are stored inside a .zip
or .tar.gz
archive file locally in the computer's filesystem; the execution will happen directly by accessing the files inside the archive without needing to unpack them to a local directory. This practice is a good idea in certain situations where we need to find a quick way to move our ETL processes around on different systems rapidly and without any pain; by packing everything in an archive file, we can move just one file instead of moving a bunch of files and directories—this really is easier!
To get ready for this recipe, you need to check that the JAVA_HOME
environment variable is properly set and then configure your environment variables so that the Kitchen script can start from anywhere without specifying the complete path to your PDI home directory. For details about these checks, refer to the recipe Executing PDI jobs from a filesystem (Simple).
To play with this recipe, you can use the samples in the directory <book_samples>/sample2
; here, <book_samples>
is the directory where you unpacked all the samples of the book.
For starting a PDI job from within a .zip
or .tar.gz
archive file in Linux or Mac, you can perform the following steps:
<book_samples>/sample2
directory..zip
archive, the syntax to be used for the URI is as follows:zip://arch-file-uri[!absolute-path]
On the other hand, if we wanted to access files contained in a .tar.gz
file, we need to use the following syntax:
tgz://arch-file-uri[!absolute-path]
–file
argument with the following syntax; but now, the syntax has changed because we need to use, as a value, the URI with the syntax we saw in the preceding step:–file: <complete_URI_to_job_file>
Remember that because we are talking about a URI to the file, we always need to consider the absolute path to the archive file followed by the path to the file we are going to start as a job.
export-job1.kjb
from within a .zip
archive, use the following syntax:$ kitchen.sh –file:'zip:///home/sramazzina/tmp/samples/samples.zip!export-job1.kjb'
kitchen.sh -param:p_country=USA –file:'zip:///home/sramazzina/tmp/samples/samples.zip!export-job1.kjb'
.tar.gz
archive, but with a different filesystem type in the URI. Here is a sample of a simple job that has been started without parameters:$ kitchen.sh –file:'tgz:///home/sramazzina/tmp/samples/samples.tar.gz!export-job1.kjb'
The following syntax is for a simple job that has been started with parameters:
kitchen.sh -param:p_country=USA –file:'tgz:///home/sramazzina/tmp/samples/samples.tar.gz!export-job1.kjb'
–dir
argument does not make any sense because we always need to access the archive through its complete URI and the –file
argument.For starting a PDI job from within a .zip
archive file in Windows, perform the following steps:
/
character instead of the –
character we used in Linux or Mac. Therefore, this means that the –file
argument will change from:–file: <complete_URI_to_job_file>
To:
/file: <complete_URI_to_job_file>
<books_samples>/sample2
; to start your sample job from within the ZIP archive, you can start the Kitchen script using the following syntax:C: empsamples>Kitchen.bat /file:'zip:///home/sramazzina/tmp/samples/samples.zip!export-job1.kjb'
<books_samples>/sample2
; to start the job by extracting all the customers for the country U. S. A, you can use the following syntax:C: empsamples>Kitchen.bat /param:p_country:USA /file:'zip:///home/sramazzina/tmp/samples/samples.zip!export-job1.kjb'
For starting PDI transformations from within archive files, perform the following steps:
pan.sh
script in the PDI home directory. To start a simple transformation from within an archive file, go to the <book_samples>/sample2
directory and type the following command:$ pan.sh –file: 'tgz:///home/sramazzina/tmp/samples/samples.tar.gz!export-job1.kjb'
Or, if you need to specify some parameters, type the following command:
$ pan.sh –param:p_country=USA –file:./read-customers.ktr
Pan.bat
script and the sample commands to start our transformation as follows:C: empsamples>Pan.bat /file='zip:///home/sramazzina/tmp/samples/samples.zip!read-customers1.ktr'
Or, if you need to specify some parameters through the command line, type the following command:
C: empsamples>Pan.bat /param:p_country:USA /file='zip:///home/sramazzina/tmp/samples/samples.zip!read-customers1.ktr'
This way of starting jobs and transformations is possible because PDI uses the Apache VFS library to accomplish this task. The Apache VFS library is a piece of software that lets you directly access files from within any type of archive by exposing them through a virtual filesystem using an appropriate set of APIs. You can find more details about the library and how it works on the Apache website at http://commons.apache.org/proper/commons-vfs.
Using jobs and transformations from within archive files slightly changes the way we design jobs and transformations. Another interesting consideration is that you can directly reference resource files packed together with your ETL process file. This lets you distribute configuration files or other kinds of resources in a uniform way. This approach could be an easy way to have a single file containing anything needed by our ETL process, making everything more portable and easier to manage. The following paragraph details the main changes applied to this new version of our sample.
When jobs or transformations are used from inside an archive, the files relate to the root of the archive, and the internal variable ${Internal.Job.Filename.Directory}
does not make any sense. Because of this, we need to change the way our example process links any kind of file.
Look at the samples located in the directory <book_samples>/sample2
; this directory contains the same transformations and jobs, but they need to undergo major changes for them to work in this case. They are as follows:
${Internal.Job.Filename.Directory}
to dynamically obtain the path to the job file. This is because, internally to the archive file, the transformation is in the root of this virtual filesystem, so the filename is enough for this purpose.${Internal.Job.Filename.Directory}
variable to specify the input and output path for the files, we added two new parameters, p_input_directory
and p_target_directory
, to let the user specify the input directory and output directory. If we have not specified a value for these parameters, we'll set a default value that is local to the directory where the job starts.