When you use one of the Kettle steps meant for reading files, Kettle expects the data organized in rows, where the columns are the fields. Suppose that instead of having a file with that structure, your file has one attribute per row as in the following example:
Mastering Joomla! 1.5 Extension and Framework Development Published: November 2007 Our price: £30.99 CakePHP 1.3 Application Development Cookbook: RAW Expected: December 2010 Our price: £24.99 Firebug 1.5: Editing, Debugging, and Monitoring Web Pages Published: April 2010 Our price: £21.99 jQuery Reference Guide ...
This file contains book information. In the file, each book is described in three rows: one for the title, one for the published or expected publishing date, and one row for the price.
There is no direct way to tell Kettle how to interpret these rows, but a simple transformation can do the trick.
Carry out the following steps:
packt_books.txt
. Under the Content tab, uncheck the Header option and as Separator, type |
. Under the Fields tab, enter a single String field named text
. text
. Fill the grid with three rows with values title, publishing_date
, and price
.You already have the fields as columns! Now, you can go a little further and do some cleansing, as follows:
publishing_date
. Check the option Create fields for capture groups. As Regular expression: type (Published|Expected):(.+)
. status
. In the second, create a Date field named pub_date
with Format MMM yyy
. In both rows, under the Trim column, select both. price
under the column In stream field, and Our price:
under the column Search. Number
. As Format, type £#.00
In the sample file, the months are written in English. Therefore, you put the mask MMM yyy
when capturing groups. If you get an error because of the Date
format, there is a high possibility that you do not have English as the preferred language in your regional settings. Consequently, Kettle is not able to parse those dates.
The Row flattener step is a simple step intended to flatten consecutive rows and is perfect for reading files such as the one in the recipe. In this case, you had a file with book information, each book occupying three consecutive rows. The Row flattener flattened the field text
into three different new fields: title, publishing_date
, and price
.
This way, every three rows, it generated a single one.
Note that if one book has a different number of rows (for example, if it lacks the price row), then you get unexpected results.