Looking for values in a variety of sources

The first recipes in the chapter showed you how to look for additional information in a database. There are still many other sources of information. You may need to look in property files, in Excel files, in text files, and so on. Kettle allows you to look for data coming from all those sources with the Stream lookup step.

In this example, you have information about books coming from an Excel file and you need to complete this dataset by looking up the author's data and genre description, which are in external sources. In this case, the author's information is inside a text file and the genres are in a fixed predefined list.

Getting ready

For doing this recipe, you will need the following:

  • A CSV file (authors.txt) with the author's data. The file should have the following columns: lastname, firstname, nationality, and id_author. The following are sample lines of this file:
    "lastname","firstname","nationality","id_author"
    "Larsson","Stieg","Swedish","A00001"
    "King","Stephen","American","A00002"
    "Hiaasen","Carl ","American","A00003"
    "Handler","Chelsea ","American","A00004"
    "Ingraham","Laura ","American","A00005"
    
  • An Excel file with the books information (books.xls). The sheet should have the following columns: title, id_author, price, id_title, and id_genre as shown in the following screenshot:
Getting ready

You can also download sample files from the book's website.

How to do it...

Carry out the following steps:

  1. Create a new transformation.
  2. Drop an Excel input step and a Text file input step into the canvas.
  3. In the Excel input step, browse for the books.xls file under the File tab and click on the Add button. Populate the grid under the Fields tab by clicking on the Get fields from header row button.
  4. In the Text file input step, browse for the authors.txt file and click on the Add button. Type, as the Separator under the Content tab and finally, populate the Fields tab grid by clicking on the Get Fields button.
  5. Add a Stream lookup step from the Lookup category.
  6. Create a hop from the Excel input step to the Stream lookup step and another from the Text file input also to the Stream lookup step.
  7. Double-click on the Stream lookup step and in the Lookup step listbox select the name of the Text file input step previously created.
  8. Complete the grids with the following information:
    How to do it...

    Tip

    To save time, you can click on the Get Fields button to automatically load the fields in the upper grid, and the Get lookup fields button to populate the lower grid.

  9. Previewing this step, you can verify that the dataset includes, for each book, the information for its author. Now, let's add the genre description.
  10. Drop a Data Grid step from the Input category. Under its Meta tab, add two String items: id_genre and genre. Then, complete the Data tab as shown in the following screenshot:
    How to do it...
  11. Add a new Stream lookup step and create a hop from the Data grid step toward this new step.
  12. Also, create a hop between both Stream Lookups steps. The transformation should look like the one shown in the following diagram:
    How to do it...
  13. Double-click on the last Stream lookup step. In the Lookup step listbox, type or select the name of the Data grid step created earlier.
  14. In the upper grid, add a row typing id_genre under the Field column and id under LookupField.
  15. In the lower grid, add a genre Field, of String Type. Add Unknown as the Default value.
  16. Doing a preview of this step, you will obtain a dataset of the books, their authors, and also the description of their genres. For example:
How to do it...

How it works...

The Stream Lookup step is the step that you should use to look for additional information in other sources. In this recipe, the main data comes from an Excel file with book titles and also, the identification for their author and genres. You used two Stream Lookup steps to look up for each title, the author's information, and the genre description respectively.

In the first Stream Lookup step the purpose is to look for the author's data; in this step, you configure the Lookup step listbox pointing to the Text file input step, which is where the author's data is coming from.

In the upper grid named The key(s) to lookup the values(s), you have to specify the fields used to join both data sources. In this case, those fields are id_author and id.

The second grid titled Specify the fields to retrieve is to declare the fields to add to the main dataset. You have typed the fields: firstname, lastname, and nationality.

The last Stream Lookup step was created in order to retrieve the genre description that matches the genre identification in the main dataset. Here the key values are the fields id_genre and id and the only data to retrieve is the field genre. Here you entered Unknown as the Default value. This means that if the Stream Lookup step doesn't find a matching genre for a row the genre description will be set to Unknown.

There's more...

The following sections provide you with some alternatives to the use of the Stream Lookup step.

Looking for alternatives when the Stream Lookup step doesn't meet your needs

The Stream Lookup step compares the fields with an equal operator. There are a couple of situations where this may not be what you need.

If you are not sure about the similarity of the values between the fields that you are comparing, a comparison by equal may fail. In that case, you could use the Fuzzy match step, which allows you to search for similar values.

If you need to compare using other operators, for example<=, then you should also look for an alternative step. One possible approach would be to use the Join Rows (Cartesian product) to retrieve all the data and filter the rows from that step with a Filter rows or a Java Filter step afterward.

Note

Keep in mind that this option might increase the number of records you are processing, slowing down your ETL process significantly.

Another option would be to transfer the source data to a database table and then lookup in the database. This approach takes a little more effort, but it has its advantages. You have more flexible ways for looking up data in a database compared with looking up in a stream. Besides, for big datasets, you can also gain performance as explained in the next subsection.

Speeding up your transformation

For big datasets, looking up in plain files with a Stream Lookup step will definitely slow down your transformation. As a workaround to speed things up, you should consider moving the data to a database table before looking up in it. The main advantage of this approach is that you can cache data, which makes the lookup task faster.

What if your ETL project doesn't involve database tasks? For these temporary lookup tables, you may want to use an in-memory database such as HSQLDB or H2.

For a practical example of this, take a look at the following blog post by Slawomir Chodnicki:

http://type-exit.org/adventures-with-open-source-bi/2011/01/using-an-on-demand-in-memory-sql-database-in-pdi/

Using the Value Mapper step for looking up from a short list of values

The second Stream Lookup step in the recipe only returns a simple description and has a short list of possible values. In these cases, you can replace this step with a Value Mapper from the Transform category.

You should complete the step, as shown in the following screenshot:

Using the Value Mapper step for looking up from a short list of values

In the Source value column, you define the possible values for the id_genre field, and in the Target value column, you define their descriptions. Also, in the Default upon non-matching textbox, you can enter a default value to be returned for the rows with invalid genre identification.

See also

  • The recipe named Looking up values by proximity in this chapter for learning how to look for similar values.
  • The recipe named Generating all possible pairs formed from two datasets from Chapter 6, Understanding Flows of Data. See this for understanding how to use the Join Rows (Cartesian product) step mentioned in the There's more section.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset