Generating all possible pairs formed from two datasets

This is a quick recipe that teaches you how to do a Cartesian product between datasets. By Cartesian product we mean taking all rows from one dataset, all rows from the other, and generate a new dataset with all the possible combinations of rows.

This particular recipe is in fact the implementation of the CAG or Community Acronym Generator as proposed by Nicholas Goodman (@nagoodman) on twitter:

@webdetails @pmalves @josvandongen How about CAG? Community Acronym Generator? A project to generate new acronyms for community projects?!

There are already several community projects around Pentaho such as CDF, CDE, or CDA. Why don't we follow Nicholas's suggestion and develop the CAG as follows:

Given two lists of words, the Kettle transformation will generate all combinations of words that lead to potential community projects.

How to do it...

Carry out the following steps:

  1. Create a new transformation and add two Data Grid steps from the Input category.
  2. Use one of the Data Grid steps to create a dataset with a single String field named middle_word. Under the Data tab, enter a set of names for the middle word of the acronym. Here you have a sample list: Dashboard, Data, Chart, Acronym, Cube, Report.
  3. Use the other Data Grid step to create a dataset with a single String field named last_word. Under the Data tab, enter a set of names for the last word of the acronym. Here you have a sample list: Framework, Editor, Translator, Access, Generator, Integrator, Component.
  4. From the Join category, add a Join Rows (Cartesian product) step.
  5. Create hops from the Data Grid steps toward this step. You will have something like the following:
    How to do it...
  6. From the Scripting category, add a User Defined Java Expression step (UDJE for short).
  7. Use the UDJE to add two String fields. Name the first new_component, and as Java Expression type"Community "+middle_word+" "+last_word. Name the second field acronym and as Java Expression type"C"+middle_word.substring(0,1)+last_word.substring(0,1).
  8. Do a preview on this last step. You will see a list of candidate community projects as shown in the following screenshot:
How to do it...

How it works...

The Join Rows (Cartesian product) step has the task of performing the Cartesian product of all streams coming to it. In this case, you had two streams but you could have had more. The step received those two streams and created all combinations of rows.

Then with the UDJE, you simply build the strings with the name of the candidate community projects and their acronyms, as for example, Community Chart Framework (CCF). The real purpose of the generated potential projects is up to your imagination and out of the scope of the book.

There's more...

In the recipe, you used the Join Rows (Cartesian product) step for joining two datasets. You could join more datasets if you want to; however that is not a common requirement.

There are a couple of settings in the step that you didn't use in the recipe. They are explained in the following subsections.

Getting variables in the middle of the stream

This section describes one of the most common situations in which you may see the Join Rows (Cartesian product) step in action. Back to the recipe. Suppose that you have a named parameter named VERSION, which can be CE (representing Community Edition) or EE (representing Enterprise Edition). After generating the names of the candidate projects and their acronyms, you want to add the version. You can add the version to your stream by using a Get Variable step from the Job category. However, instead of getting the variable for each row, it's recommended to get it outside the main stream and then join both streams, as shown in the following screenshot:

Getting variables in the middle of the stream

As the stream coming out of the Get Variable step has a single row, the Cartesian product will have all the possible combinations of N rows of the main stream with a single row, that is, N rows. In this case, it is important that in the Main step to read from option, you select the main stream, the stream coming from the UDJE. Doing so, you tell Kettle that most of the data will come from this step and Kettle will cache or spool to disk the data coming from the Get Variable step.

Limiting the number of output rows

With a Join Rows (Cartesian product) step, you can limit the number of output rows by entering a simple or complex condition in its setting window. The rows that don't meet the condition are discarded.

Back to the recipe. As you might have noticed, it is possible for the transformation to generate acronyms that already exist, for example, CDF. In the previous subsection, you added a second Join Rows (Cartesian product) step. In this step, you could add a condition to discard the rows with acronyms that already exist, excepting when the product is Enterprise Edition. The condition area in the setting window of the step would look like the one shown in the following screenshot (except for the exact list of acronyms which might have changed by the time you're reading this):

Limiting the number of output rows

If you do a preview on this step, you will see something like the following:

Limiting the number of output rows

If you take a look at the Step Metrics tab of the Execution Results window, you will notice that the number of written rows is less than Cartesian product of incoming rows. Note that the GUI for entering the condition is the same as the one in the Filter rows step.

Note

As you may pick fields from more than one stream in the condition of the Join Rows (Cartesian product) step, it is therefore required that the picked fields have unique names in the streams.

See also

  • The recipe named Joining two or more streams based on giving conditions in this chapter. This recipe shows you an alternative step to the Join Rows (Cartesian product) step, with better performance.
  • The recipe named Splitting a stream into two or more streams based on a condition in this chapter. This recipe helps you configure the filter area in the Join Rows (Cartesian product) setting window.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset