In Kettle, you have a lot of functionality provided by the built-in steps, but if that is not enough for you, there is a step named User Defined Java Class (UDJC for short) where you can program custom functionality with Java code. In this way, you can accomplish complex tasks, access Java libraries, and even access the Kettle API. The code you type into this step is compiled once and executed at runtime for each passing row.
Let's create a simple example of the use of the UDJC step. Assume that you have a text file containing sentences; you want to count the words in each row and split the flow of data into two streams depending on the number of words per sentence.
Note that in order to develop a more interesting exercise, we added some extra considerations, as follows:
You need a text file containing sentences, for example:
This is a sample text. Another text with special characters, , , END OF FILE hello,man I wish you a happy new year:2011 The,last.but,not;the.least
You can download the sample file from the book's website.
Carry out the following steps:
${Internal.Transformation.Filename.Directory}samplefile.txt
. #####
in the Separator textbox. String
type. short sentences
step and long sentences
step. qty
. Select Integer in the Type listbox. shortOnes
and longOnes
. Then, select the steps as shown in the following screenshot: processRow()
.private RowSet shortSentences = null; private RowSet longSentences = null; public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException { Object[] r = getRow(); if (r == null) { setOutputDone(); return false; } if (first) { first = false; shortSentences = findTargetRowSet("shortOnes"); longSentences = findTargetRowSet("longOnes"); } r = createOutputRow(r, data.outputRowMeta.size()); String linein; linein = get(Fields.In, "line").getString(r); long len = linein.length(); long qty = 0; boolean currentSpecialChar = false; for (int i=0;i'len;i++) { KettleKettlecustom functionality, programmingchar ch = linein.charAt(i); switch(ch) { case ',': case '.': case ' ': case ';': case ':': if (!currentSpecialChar) qty++; currentSpecialChar = true; break; default: currentSpecialChar = false; break; } } if (!currentSpecialChar) qty++; get(Fields.Out, "qty").setValue(r, qty); if (qty ' 7) { putRowTo(data.outputRowMeta, r, shortSentences); } else { putRowTo(data.outputRowMeta, r, longSentences); } return true; }
qty
with the number of words in each sentence. The results for the file used as an example would be: short sentences
step or the long sentences
step depending on the field qty
. You can preview both steps and verify that the sentences with exactly 7 or more than 7 words flow to the long sentences
step, and those with fewer words flow to the short sentences
step.The first step in the recipe has the task of reading the text file. You have used #####
as separator characters because that string is not present in the file. This assures you that the field line
will contain the entire line of your file.
Now, you need to develop some custom code that counts the words in the line
field. This task can't be accomplished using standard Kettle steps, so you have programmed the necessary functionality in Java code inside a UDJC step.
Let's explore the dialog for the UDJC step, which is shown in the following screenshot:
Most of the window is occupied by the editing area. Here you write the Java code using the standard syntax of that language. On the left, there is a panel with many code fragments ready to use (Code Snippets), and a section with sets and gets for the input and output fields. To add a code fragment to your script either double-click on it, drag it to the location in your script where you wish to use it, or just type it in the editing area.
The Input and Outputs fields appear automatically in the tree when the Java code compiles correctly; you can use the Test class button to verify that the code is properly written. If an error occurs, then you will see an error window otherwise, you will be able to preview the results of the transformation step.
The Fields tab on the bottom is to declare the new fields added by the step. In this case, you declared the field qty
of Integer type. This field will be the output variable that will return the word count.
Under the Target steps tab, you can declare the steps where the rows will be redirected. In the recipe, you pointed to the two target Dummy steps; one for the short sentences, and the other for the long ones.
Let's see the Java code in detail:
At the beginning, there is a section with the variable declarations:
private RowSet shortSentences = null; private RowSet longSentences = null;
These variables will represent the two possible target steps. They are declared at the beginning because this way they keep their values between the row processing.
Then, you have the main function:
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException { Object[] r = getRow(); if (r == null) { setOutputDone(); return false; }
The processRow()
function process a new row.
The getRow()
function gets the next row from the input steps. It returns an object array with the incoming row. A null
value means that there are no more rows for processing.
The following code only executes for the first row:
if (first) { first = false; shortSentences = findTargetRowSet("shortOnes"); longSentences = findTargetRowSet("longOnes"); }
You can use the flag first
to prepare a proper environment before processing the rows. In this case, you set the target steps into two variables for further use.
The next code use the get()
method to set the internal variable linein
with the value of the line
field.
r = createOutputRow(r, data.outputRowMeta.size()); String linein; linein = get(Fields.In, "line").getString(r);
Here is the main cycle:
long len = linein.length(); long qty = 0; boolean currentSpecialChar = false; for (int i=0;i'len;i++) { char ch = linein.charAt(i); switch(ch) { case ',': case '.': case ' ': case ';': case ':': if (!currentSpecialChar) qty++; currentSpecialChar = true; break; default: currentSpecialChar = false; break; } }
It parses the entire sentence looking for characters used as separators. If one of these separators is found, then it will increment the qty
variable by one and set the flag currentSpecialChar
to true
, in order to not increment the value if the next character is also a separator.
The next line is to count the last word of the sentence only if it wasn't counted in the main cycle:
if (!currentSpecialChar) qty++;
Here we set the new field named qty
with the value of the internal variable qty
, which has the word count:
get(Fields.Out, "qty").setValue(r, qty);
Finally, if the word count is lower than 7
(arbitrary value), then the row will be passed to the short sentences
step; otherwise the target will be the long sentences
step.
if (qty ' 7) { putRowTo(data.outputRowMeta, r, shortSentences); } else { putRowTo(data.outputRowMeta, r, longSentences); } return true; }
To learn about the syntax of the Java language, visit the following URL:
http://download.oracle.com/javase/tutorial/
As mentioned earlier, you can access the Kettle API from inside the UDJC code. To learn the details of the API, you should check the source. For instructions on getting the code, follow this link: http://community.pentaho.com/getthecode/.
Let's see some more information to take advantage of this very useful step:
The code you type inside the UDJC step is pure Java. Therefore, the fields of your transformation will be seen as Java objects according to the following equivalence table:
Data type in Kettle |
Java Class |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
The opposite occurs when you create an object inside the Java code and want to expose it as a new field to your transformation. For example, in the Java code, you defined the variable qty
as long
but under the Fields tab, you defined the new output field as Integer
.
You can generalize your code by using parameters. You can add parameters and their values using the grid located under the Parameters tab at the bottom of the UDJC window.
In our example, you could have defined a parameter named qtyParameter
with the value 7
. Then, in the Java code, you would have obtained this value with the following line of code:
long qty = getParameter("qtyParameter");
You can also have additional steps that provide information to be read inside your Java code. They are called Info steps. You declare them in the grid located under the Info step tab at the bottom of the UDJC window.
In our recipe, suppose that you have the list of separators defined in a Data Grid step. In order to pick the separators from that list, you have to create a hop from the Data Grid towards the UDJC step and fill the Info step grid. You must provide a Tag name (for example, charsList)
and select the name of the incoming step. Then, in the Java code, you can use the findInfoRowSet()
method to reference the info step, and the getRowFrom()
method to read the rows in a cycle. Check the code:
RowSet data = findInfoRowSet("charsList"); Object[] dataRow = null; while((dataRow = getRowFrom(data)) != null){ //Do something }
You can add your custom messages for different levels of logging very easily. You can select the fragment of necessary code from the Step logging node in the Code Snippets
folder, or just type the method in the edition area. For example:
if (qty ' 10) logBasic("Long sentence found!");
As an alternative to the UDJC step, there is another step named User Defined Java Expression (UDJE) also in the Scripting category. This step allows you to create new fields in an easy way by typing Java expressions. This step doesn't replace the functionality of that one, but it is more practical when the task you have to accomplish is simple.
For examples on how to use this step, browse the different recipes in the book. There are several examples that use the UDJE step.
If you are more familiar with JavaScript language, instead of UDJC you could use the Modified Java Script Value (MJSV) step. Take into account that the code in the JavaScript step is interpreted, against the UDJC that is compiled; this means that a transformation that uses the UDJC step will have much better performance.
The UI for the MJSV step is very similar to the UI for the UDJC step; there is a main area to write the JavaScript code, a left panel with many functions as snippets, the input fields coming from the previous step, and the output fields.
You can learn more about JavaScript here: http://www.w3schools.com/js. As an additional resource, you can get Pentaho 3.2 Data Integration: Beginner's Guide , María Carina Roldán, Packt Publishing. There is a complete chapter devoted to the use of the Modified Java Script Value step.
Finally, if you prefer scripting to Java programming, there is a new Kettle plugin named Ruby Scripting developed by Slawomir Chodnicki, one of the most active Kettle contributors. As the name suggest, the step allows you develop custom functionality by typing Ruby code. The UI for the Ruby plugin is very similar to the UI for the UDJC step. In this case, you don't have snippets but you have many sample transformations that demonstrate the capabilities of the plugin. Along with the samples, you have a couple of links to Ruby resources on the web. The plugin is available at the following URL: