Programming custom functionality

In Kettle, you have a lot of functionality provided by the built-in steps, but if that is not enough for you, there is a step named User Defined Java Class (UDJC for short) where you can program custom functionality with Java code. In this way, you can accomplish complex tasks, access Java libraries, and even access the Kettle API. The code you type into this step is compiled once and executed at runtime for each passing row.

Let's create a simple example of the use of the UDJC step. Assume that you have a text file containing sentences; you want to count the words in each row and split the flow of data into two streams depending on the number of words per sentence.

Note that in order to develop a more interesting exercise, we added some extra considerations, as follows:

  • There are several characters as separators, not only the blank spaces
  • Sometimes, you can have a sequence of separators together
  • Some sentences have a special character at the end, and some don't

Getting ready

You need a text file containing sentences, for example:

This is a sample text.
Another text with special characters, , , END OF FILE
hello,man
I wish you a happy new year:2011
The,last.but,not;the.least

You can download the sample file from the book's website.

How to do it...

Carry out the following steps:

  1. Create a new transformation.
  2. Drop a Text file input step from the Input category. Browse to your file under the Files tab, and press the Add button to populate the selected files grid. For example: ${Internal.Transformation.Filename.Directory}samplefile.txt.
  3. Under the Content tab, uncheck the Header checkbox, and type ##### in the Separator textbox.
  4. Under the Fields tab, add a field named line of String type.
  5. Add a UDJC step from the Scripting category. Also, drop two Dummy steps from the Flow category and name them: short sentences step and long sentences step.
  6. Create the hops between the steps as per the ones shown in the following diagram:
    How to do it...
  7. Double-click on the UDJC step.
  8. Under the Fields tab of the bottom grid, add a new field named qty. Select Integer in the Type listbox.
  9. Under the Target steps tab, create two tags and name them: shortOnes and longOnes. Then, select the steps as shown in the following screenshot:
    How to do it...
  10. In the Classes and code fragments section on the left, open the Code Snippets folder. Expand the Common use option and drop the Main item to the editing area on the right. A fragment of Java code will be written for the function processRow().
  11. Replace or complete this fragment with the following code:
    private RowSet shortSentences = null;
    private RowSet longSentences = null;
    public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
    {
    Object[] r = getRow();
    if (r == null) {
    setOutputDone();
    return false;
    }
    if (first) {
    first = false;
    shortSentences = findTargetRowSet("shortOnes");
    longSentences = findTargetRowSet("longOnes");
    }
    r = createOutputRow(r, data.outputRowMeta.size());
    String linein;
    linein = get(Fields.In, "line").getString(r);
    long len = linein.length();
    long qty = 0;
    boolean currentSpecialChar = false;
    for (int i=0;i'len;i++) {
    KettleKettlecustom functionality, programmingchar ch = linein.charAt(i);
    switch(ch) {
    case ',':
    case '.':
    case ' ':
    case ';':
    case ':':
    if (!currentSpecialChar) qty++;
    currentSpecialChar = true;
    break;
    default:
    currentSpecialChar = false;
    break;
    }
    }
    if (!currentSpecialChar) qty++;
    get(Fields.Out, "qty").setValue(r, qty);
    if (qty ' 7) {
    putRowTo(data.outputRowMeta, r, shortSentences);
    }
    else {
    putRowTo(data.outputRowMeta, r, longSentences);
    }
    return true;
    }
    

    Note

    The code snippet added with the Main item generates a row with this line:

    r = createOutputRow(r, outputRowSize);
    

    This line must be replaced with the following code to compile correctly:

    r = createOutputRow(r, data.outputRowMeta.size());
    
  12. This code adds the desired functionality. If you preview this step, then you will obtain a new field named qty with the number of words in each sentence. The results for the file used as an example would be:
    How to do it...
  13. This UDJC step also redirects the rows to the short sentences step or the long sentences step depending on the field qty. You can preview both steps and verify that the sentences with exactly 7 or more than 7 words flow to the long sentences step, and those with fewer words flow to the short sentences step.

How it works...

The first step in the recipe has the task of reading the text file. You have used ##### as separator characters because that string is not present in the file. This assures you that the field line will contain the entire line of your file.

Now, you need to develop some custom code that counts the words in the line field. This task can't be accomplished using standard Kettle steps, so you have programmed the necessary functionality in Java code inside a UDJC step.

Let's explore the dialog for the UDJC step, which is shown in the following screenshot:

How it works...

Most of the window is occupied by the editing area. Here you write the Java code using the standard syntax of that language. On the left, there is a panel with many code fragments ready to use (Code Snippets), and a section with sets and gets for the input and output fields. To add a code fragment to your script either double-click on it, drag it to the location in your script where you wish to use it, or just type it in the editing area.

The Input and Outputs fields appear automatically in the tree when the Java code compiles correctly; you can use the Test class button to verify that the code is properly written. If an error occurs, then you will see an error window otherwise, you will be able to preview the results of the transformation step.

The Fields tab on the bottom is to declare the new fields added by the step. In this case, you declared the field qty of Integer type. This field will be the output variable that will return the word count.

Under the Target steps tab, you can declare the steps where the rows will be redirected. In the recipe, you pointed to the two target Dummy steps; one for the short sentences, and the other for the long ones.

Let's see the Java code in detail:

At the beginning, there is a section with the variable declarations:

private RowSet shortSentences = null;
private RowSet longSentences = null;

These variables will represent the two possible target steps. They are declared at the beginning because this way they keep their values between the row processing.

Then, you have the main function:

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
Object[] r = getRow();
if (r == null) {
setOutputDone();
return false;
}

The processRow() function process a new row.

The getRow() function gets the next row from the input steps. It returns an object array with the incoming row. A null value means that there are no more rows for processing.

The following code only executes for the first row:

if (first) {
first = false;
shortSentences = findTargetRowSet("shortOnes");
longSentences = findTargetRowSet("longOnes");
}

You can use the flag first to prepare a proper environment before processing the rows. In this case, you set the target steps into two variables for further use.

The next code use the get() method to set the internal variable linein with the value of the line field.

r = createOutputRow(r, data.outputRowMeta.size());
String linein;
linein = get(Fields.In, "line").getString(r);

Here is the main cycle:

long len = linein.length();
long qty = 0;
boolean currentSpecialChar = false;
for (int i=0;i'len;i++) {
char ch = linein.charAt(i);
switch(ch) {
case ',':
case '.':
case ' ':
case ';':
case ':':
if (!currentSpecialChar) qty++;
currentSpecialChar = true;
break;
default:
currentSpecialChar = false;
break;
}
}

It parses the entire sentence looking for characters used as separators. If one of these separators is found, then it will increment the qty variable by one and set the flag currentSpecialChar to true, in order to not increment the value if the next character is also a separator.

The next line is to count the last word of the sentence only if it wasn't counted in the main cycle:

if (!currentSpecialChar) qty++;

Here we set the new field named qty with the value of the internal variable qty, which has the word count:

get(Fields.Out, "qty").setValue(r, qty);

Finally, if the word count is lower than 7 (arbitrary value), then the row will be passed to the short sentences step; otherwise the target will be the long sentences step.

if (qty ' 7) {
putRowTo(data.outputRowMeta, r, shortSentences);
}
else {
putRowTo(data.outputRowMeta, r, longSentences);
}
return true;
}

Note

If you only have one target step, then you can use the simpler putRow() method.

There's more...

To learn about the syntax of the Java language, visit the following URL:

http://download.oracle.com/javase/tutorial/

As mentioned earlier, you can access the Kettle API from inside the UDJC code. To learn the details of the API, you should check the source. For instructions on getting the code, follow this link: http://community.pentaho.com/getthecode/.

Let's see some more information to take advantage of this very useful step:

Data type's equivalence

The code you type inside the UDJC step is pure Java. Therefore, the fields of your transformation will be seen as Java objects according to the following equivalence table:

Data type in Kettle

Java Class

String

Java.lang.String

Integer

Java.lang.Long

Number

Java.lang.Double

Date

Java.util.Date

BigNumer

BigDecimal

Binary

byte[]

The opposite occurs when you create an object inside the Java code and want to expose it as a new field to your transformation. For example, in the Java code, you defined the variable qty as long but under the Fields tab, you defined the new output field as Integer.

Generalizing you code

You can generalize your code by using parameters. You can add parameters and their values using the grid located under the Parameters tab at the bottom of the UDJC window.

In our example, you could have defined a parameter named qtyParameter with the value 7. Then, in the Java code, you would have obtained this value with the following line of code:

long qty = getParameter("qtyParameter");

Looking up information with additional steps

You can also have additional steps that provide information to be read inside your Java code. They are called Info steps. You declare them in the grid located under the Info step tab at the bottom of the UDJC window.

In our recipe, suppose that you have the list of separators defined in a Data Grid step. In order to pick the separators from that list, you have to create a hop from the Data Grid towards the UDJC step and fill the Info step grid. You must provide a Tag name (for example, charsList) and select the name of the incoming step. Then, in the Java code, you can use the findInfoRowSet() method to reference the info step, and the getRowFrom() method to read the rows in a cycle. Check the code:

RowSet data = findInfoRowSet("charsList");
Object[] dataRow = null;
while((dataRow = getRowFrom(data)) != null){
//Do something
}

Customizing logs

You can add your custom messages for different levels of logging very easily. You can select the fragment of necessary code from the Step logging node in the Code Snippets folder, or just type the method in the edition area. For example:

if (qty ' 10) logBasic("Long sentence found!");

Scripting alternatives to the UDJC step

As an alternative to the UDJC step, there is another step named User Defined Java Expression (UDJE) also in the Scripting category. This step allows you to create new fields in an easy way by typing Java expressions. This step doesn't replace the functionality of that one, but it is more practical when the task you have to accomplish is simple.

Tip

For examples on how to use this step, browse the different recipes in the book. There are several examples that use the UDJE step.

If you are more familiar with JavaScript language, instead of UDJC you could use the Modified Java Script Value (MJSV) step. Take into account that the code in the JavaScript step is interpreted, against the UDJC that is compiled; this means that a transformation that uses the UDJC step will have much better performance.

The UI for the MJSV step is very similar to the UI for the UDJC step; there is a main area to write the JavaScript code, a left panel with many functions as snippets, the input fields coming from the previous step, and the output fields.

You can learn more about JavaScript here: http://www.w3schools.com/js. As an additional resource, you can get Pentaho 3.2 Data Integration: Beginner's Guide , María Carina Roldán, Packt Publishing. There is a complete chapter devoted to the use of the Modified Java Script Value step.

Finally, if you prefer scripting to Java programming, there is a new Kettle plugin named Ruby Scripting developed by Slawomir Chodnicki, one of the most active Kettle contributors. As the name suggest, the step allows you develop custom functionality by typing Ruby code. The UI for the Ruby plugin is very similar to the UI for the UDJC step. In this case, you don't have snippets but you have many sample transformations that demonstrate the capabilities of the plugin. Along with the samples, you have a couple of links to Ruby resources on the web. The plugin is available at the following URL:

https://github.com/type-exit/Ruby-Scripting-for-Kettle

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset