This is a quick recipe that teaches you how to do a Cartesian product between datasets. By Cartesian product we mean taking all rows from one dataset, all rows from the other, and generate a new dataset with all the possible combinations of rows.
This particular recipe is in fact the implementation of the CAG or Community Acronym Generator as proposed by Nicholas Goodman (@nagoodman) on twitter:
@webdetails @pmalves @josvandongen How about CAG? Community Acronym Generator? A project to generate new acronyms for community projects?!
There are already several community projects around Pentaho such as CDF, CDE, or CDA. Why don't we follow Nicholas's suggestion and develop the CAG as follows:
Given two lists of words, the Kettle transformation will generate all combinations of words that lead to potential community projects.
Carry out the following steps:
middle_word
. Under the Data tab, enter a set of names for the middle word of the acronym. Here you have a sample list: Dashboard, Data, Chart, Acronym, Cube, Report
. last_word
. Under the Data tab, enter a set of names for the last word of the acronym. Here you have a sample list: Framework, Editor, Translator, Access, Generator, Integrator, Component
. String
fields. Name the first new_component
, and as Java Expression type"Community "+middle_word+" "+last_word
. Name the second field acronym
and as Java Expression type"C"+middle_word.substring(0,1)+last_word.substring(0,1)
.The Join Rows (Cartesian product) step has the task of performing the Cartesian product of all streams coming to it. In this case, you had two streams but you could have had more. The step received those two streams and created all combinations of rows.
Then with the UDJE, you simply build the strings with the name of the candidate community projects and their acronyms, as for example, Community Chart Framework (CCF). The real purpose of the generated potential projects is up to your imagination and out of the scope of the book.
In the recipe, you used the Join Rows (Cartesian product) step for joining two datasets. You could join more datasets if you want to; however that is not a common requirement.
There are a couple of settings in the step that you didn't use in the recipe. They are explained in the following subsections.
This section describes one of the most common situations in which you may see the Join Rows (Cartesian product) step in action. Back to the recipe. Suppose that you have a named parameter named VERSION
, which can be CE
(representing Community Edition) or EE
(representing Enterprise Edition). After generating the names of the candidate projects and their acronyms, you want to add the version. You can add the version to your stream by using a Get Variable step from the Job category. However, instead of getting the variable for each row, it's recommended to get it outside the main stream and then join both streams, as shown in the following screenshot:
As the stream coming out of the Get Variable step has a single row, the Cartesian product will have all the possible combinations of N
rows of the main stream with a single row, that is, N
rows. In this case, it is important that in the Main step to read from option, you select the main stream, the stream coming from the UDJE. Doing so, you tell Kettle that most of the data will come from this step and Kettle will cache or spool to disk the data coming from the Get Variable step.
With a Join Rows (Cartesian product) step, you can limit the number of output rows by entering a simple or complex condition in its setting window. The rows that don't meet the condition are discarded.
Back to the recipe. As you might have noticed, it is possible for the transformation to generate acronyms that already exist, for example, CDF. In the previous subsection, you added a second Join Rows (Cartesian product) step. In this step, you could add a condition to discard the rows with acronyms that already exist, excepting when the product is Enterprise Edition. The condition area in the setting window of the step would look like the one shown in the following screenshot (except for the exact list of acronyms which might have changed by the time you're reading this):
If you do a preview on this step, you will see something like the following:
If you take a look at the Step Metrics tab of the Execution Results window, you will notice that the number of written rows is less than Cartesian product of incoming rows. Note that the GUI for entering the condition is the same as the one in the Filter rows step.