In the previous recipe, you saw how to search for columns in a database table based on simple conditions. With Kettle, you can also search by providing complex conditions or involving more than one table. In this recipe, you will learn how to perform that kind of search by using the Database join step.
In order to let you compare the different options for searching data in a database with ease, we will work with the same example that you saw in the preceding recipe: the Steel Wheels sample data. You want to look for products that match a given search term and whose prices are below a given value.
Carry out the following steps:
sampledata
database.SELECT PRODUCTNAME , PRODUCTSCALE , BUYPRICE FROM PRODUCTS WHERE PRODUCTNAME LIKE concat(‹%›,?,›%›) AND BUYPRICE < ?
prod
and max_price
.The Database join step is a powerful step for looking for data in a database based on given conditions. The conditions usually involve comparisons between columns in tables and fields in your stream; therefore it's called a join.
Note that this is not really a database join. Instead of joining tables in a database, you are joining the result of a database query with a Kettle dataset.
The question marks you type in the SQL statement (those in the highlighted lines in the recipe) represent parameters. The purpose of these parameters is to be replaced with the fields you provide in the lower grid. For each row in your stream the Database join step replaces the parameters in the same order as they are in the grid, and executes the SQL statement.
If we take as an example the first row in the dataset, the SQL statement after the replacement of the parameters would look like this:
SELECT PRODUCTNAME , PRODUCTSCALE , BUYPRICE FROM PRODUCTS WHERE PRODUCTNAME LIKE concat('%','Aston Martin','%') AND BUYPRICE < 90
And that's exactly the search you wanted to do for that row: look for the records where the column BUYPRICE
was less than 90
and the column PRODUCTNAME
contained Aston Martin
.
As the result of the database join you can retrieve any number of columns. Each database column that you type in the SELECT
clause will become a new field in your dataset. In the recipe, those fields were PRODUCTNAME, PRODUCTSCALE
, and BUYPRICE
. In particular, if you had typed SELECT *
, you would have retrieved all columns in the tables involved in the statement.
In the recipe, you checked the Outer join? option. The effect of this is as follows: For the rows where the lookup fails, the new fields are retrieved with null values. That was the case for the second row. There were no products with Ford Falcon
as part of its description and price lower than 70
. Therefore, that row shows empty values for the PRODUCTNAME, PRODUCTSCALE
, and BUYPRICE
fields.
Note that in the recipe, you filled the grid with two fields. That is exactly the same number of question marks in the statement.
The number of fields in the grid must be exactly the same as the number of question marks in the query.
Also, note that in the grid the prod
field was in the first place and the max_price
in the second place. If you look at the highlighted lines in the recipe, you will see that the statement expected the parameters in exactly that order.
So far, the results are quite similar to those you got with a database lookup. There is a significant difference however. For the third row, the Corvette
product, you can see two results. This means that the Database join found two matching rows in the database, and retrieved them both. This is not possible with a Database lookup step.
The Database join step can be a little complicated to use or to understand compared to the Database lookup step. While the Database lookup step has a UI that makes the configuration of the step easy, in the Database join step, you have to write an SQL statement. That implies that you need a good knowledge of SQL. However, the Database join step has several advantages over the Database lookup one: