Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Generating all possible pairs formed from two datasets

This is a quick recipe that teaches you how to do a Cartesian product between datasets. By Cartesian product we mean taking all rows from one dataset, all rows from the other, and generate a new dataset with all the possible combinations of rows.

This particular recipe is in fact the implementation of the CAG or Community Acronym Generator as proposed by Nicholas Goodman (@nagoodman) on twitter:

@webdetails @pmalves @josvandongen How about CAG? Community Acronym Generator? A project to generate new acronyms for community projects?!

There are already several community projects around Pentaho such as CDF, CDE, or CDA. Why don't we follow Nicholas's suggestion and develop the CAG as follows:

Given two lists of words, the Kettle transformation will generate all combinations of words that lead to potential community projects.

How to do it...

Carry out the following steps:

Create a new transformation and add two Data Grid steps from the Input category.
Use one of the Data Grid steps to create a dataset with a single String field named middle_word. Under the Data tab, enter a set of names for the middle word of the acronym. Here you have a sample list: Dashboard, Data, Chart, Acronym, Cube, Report.
Use the other Data Grid step to create a dataset with a single String field named last_word. Under the Data tab, enter a set of names for the last word of the acronym. Here you have a sample list: Framework, Editor, Translator, Access, Generator, Integrator, Component.
From the Join category, add a Join Rows (Cartesian product) step.
Create hops from the Data Grid steps toward this step. You will have something like the following:
From the Scripting category, add a User Defined Java Expression step (UDJE for short).
Use the UDJE to add two String fields. Name the first new_component, and as Java Expression type"Community "+middle_word+" "+last_word. Name the second field acronym and as Java Expression type"C"+middle_word.substring(0,1)+last_word.substring(0,1).
Do a preview on this last step. You will see a list of candidate community projects as shown in the following screenshot:

How it works...

The Join Rows (Cartesian product) step has the task of performing the Cartesian product of all streams coming to it. In this case, you had two streams but you could have had more. The step received those two streams and created all combinations of rows.

Then with the UDJE, you simply build the strings with the name of the candidate community projects and their acronyms, as for example, Community Chart Framework (CCF). The real purpose of the generated potential projects is up to your imagination and out of the scope of the book.

There's more...

In the recipe, you used the Join Rows (Cartesian product) step for joining two datasets. You could join more datasets if you want to; however that is not a common requirement.

There are a couple of settings in the step that you didn't use in the recipe. They are explained in the following subsections.

Getting variables in the middle of the stream

This section describes one of the most common situations in which you may see the Join Rows (Cartesian product) step in action. Back to the recipe. Suppose that you have a named parameter named VERSION, which can be CE (representing Community Edition) or EE (representing Enterprise Edition). After generating the names of the candidate projects and their acronyms, you want to add the version. You can add the version to your stream by using a Get Variable step from the Job category. However, instead of getting the variable for each row, it's recommended to get it outside the main stream and then join both streams, as shown in the following screenshot:

As the stream coming out of the Get Variable step has a single row, the Cartesian product will have all the possible combinations of N rows of the main stream with a single row, that is, N rows. In this case, it is important that in the Main step to read from option, you select the main stream, the stream coming from the UDJE. Doing so, you tell Kettle that most of the data will come from this step and Kettle will cache or spool to disk the data coming from the Get Variable step.

Limiting the number of output rows

With a Join Rows (Cartesian product) step, you can limit the number of output rows by entering a simple or complex condition in its setting window. The rows that don't meet the condition are discarded.

Back to the recipe. As you might have noticed, it is possible for the transformation to generate acronyms that already exist, for example, CDF. In the previous subsection, you added a second Join Rows (Cartesian product) step. In this step, you could add a condition to discard the rows with acronyms that already exist, excepting when the product is Enterprise Edition. The condition area in the setting window of the step would look like the one shown in the following screenshot (except for the exact list of acronyms which might have changed by the time you're reading this):

If you do a preview on this step, you will see something like the following:

If you take a look at the Step Metrics tab of the Execution Results window, you will notice that the number of written rows is less than Cartesian product of incoming rows. Note that the GUI for entering the condition is the same as the one in the Filter rows step.

Note

As you may pick fields from more than one stream in the condition of the Join Rows (Cartesian product) step, it is therefore required that the picked fields have unique names in the streams.

Table of Contents for
Generating all possible pairs formed from two datasets

Generating all possible pairs formed from two datasets

How to do it...

How it works...

There's more...

Getting variables in the middle of the stream

Limiting the number of output rows

Note

See also

Table of Contents for Generating all possible pairs formed from two datasets

Create new playlist

Sign In

Sign Up

Generating all possible pairs formed from two datasets

How to do it...

How it works...

There's more...

Getting variables in the middle of the stream

Limiting the number of output rows

Note

See also

Table of Contents for
Generating all possible pairs formed from two datasets