This recipe guides you through creating a simple PDI job using the graphical development environment Spoon. In a PDI process, jobs orchestrate other jobs and transformations in a coordinated way to realize our business process. This simple job uses the transformation created in the previous recipe, and we will use it as a simple example in this book's recipes, wherever necessary, to play with PDI command-line tools.
To get ready for this recipe, you need to check that the JAVA_HOME
environment variable is set properly and then start the Spoon script. For more information on this, check what we detailed in the first recipe, Designing a sample PDI transformation (Simple).
p_country
. Do this as follows:p_country
.${Internal.Transformation.Filename.Directory}/selected_country_customers.xls
.${Internal.Job.Filename.Directory}/read-customers.ktr
.export-job.kjb
, in the same directory where you previously saved the transformation.A Pentaho ETL process is created generally by a set of jobs and transformations.
Transformations are workflows whose role is to perform actions on a flow of data by typically applying a set of basic action steps to the data. A transformation can be made by:
Input steps take data from external sources and bring them into the transformation. Examples of input steps are as follows:
Transformation steps apply elementary business rules to the flow of data; the composition of this set of elementary transformation steps into an organized flow of operations represents a process. Examples of transformation steps are those that perform the following actions:
Output steps send the data from the flow to external targets, such as databases, files, web services, or others. Therefore, we can say that transformations act as a sort of unit of work in the context of an entire ETL process. The more a transformation is atomic and concrete in its work, the more we can reuse it throughout other ETL processes.
Jobs are workflows whose role is to orchestrate the execution of a set of tasks: they generally synchronize and prioritize the execution of tasks and give an order of execution based on the success or failure of the execution of the current task. These tasks are basic tasks that either prepare the execution environment for other tasks that are next in the execution workflow or that manage the artifacts produced by tasks that are preceding them in the execution workflow. For example, we have tasks that let us manipulate the files and directories in the local filesystem, tasks that move files between remote servers through FTP or SSH, and tasks that check the availability of a table or the content of a table. Any job can call other jobs or transformations to design more complex processes. Therefore, generally speaking, jobs orchestrate the execution of jobs and transformations into large ETL processes.
In our case, we have a very simple example with a job and a transformation to support our recipes' experiments. The transformation gets data from a text file that contains a set of customers by country. After the data from the text file is loaded, it filters the dataflow by country and prints the result on an Excel file. The filter is made using a parameter that you set at the time you start the job and the filter step. The job checks if the previous file exists, and if so, deletes it and then calls the transformation for a new extraction. The job also has some failure paths to manage any sort of error condition that could occur during the processing of the tasks. The failure paths terminate with a step that aborts the job, marking it as failed.
Every time we design a job or a transformation, there are some basic rules to follow to help you make things more easily portable between different systems, and, eventually, self-describing. The use of internal variables and a proper naming system for your job tasks and transformation steps are good rules of thumb. Then at the end, a brief recap of the various color and icon indicators that are implicitly present in the design of your process is also a good exercise. They help you to understand how the flow of information moves (inside your transformations) or how the set of operations execute in your flow (inside a job) quickly.
Each task in a job and each step in a transformation has a set of properties to let the user configure the expected behavior; one of these properties is used to give it a name. Giving tasks and steps an appropriate name is a very important thing because it helps us to make our transformations or jobs more readable. This suggestion becomes more valuable as the process becomes bigger. Documentation is always considered an unpleasant thing to do, but documenting processes is the only way to remember what we made, and why, in the long term. In this case, the correct naming of our components is something that, if done well, can be considered as a documentation in itself, at least for the insiders. And that is good enough!
When writing our sample jobs and transformations, we used internal variables to set the path of the files we are reading or writing and set the path of the transformation file we have linked. This is very important to make our transformation and job location unaware so that we can easily move them here and there in our servers without any pain.
Kettle has two internal variables for this that you can access whenever required. By pressing Ctrl + Space directly from inside the field, you can activate the variables inspector to help you with finding out the right variable name without having to struggle with it. Going back to our problem of building a location-independent path, Kettle has two important system variables for this:
The important thing about these two variables is that PDI resolves them dynamically at runtime. So whenever you refer to a file, if you properly refer the path of the referred file to one of these two variables (depending on the case), you will be able to build location-unaware processes that will give you the ability to move them around without any pain.