Reading files having one field by row

When you use one of the Kettle steps meant for reading files, Kettle expects the data organized in rows, where the columns are the fields. Suppose that instead of having a file with that structure, your file has one attribute per row as in the following example:

Mastering Joomla! 1.5 Extension and Framework Development
Published: November 2007
Our price: £30.99
CakePHP 1.3 Application Development Cookbook: RAW
Expected: December 2010
Our price: £24.99
Firebug 1.5: Editing, Debugging, and Monitoring Web Pages
Published: April 2010
Our price: £21.99
jQuery Reference Guide
...

This file contains book information. In the file, each book is described in three rows: one for the title, one for the published or expected publishing date, and one row for the price.

There is no direct way to tell Kettle how to interpret these rows, but a simple transformation can do the trick.

Getting ready

Create a file containing the preceding text or download the sample file from the book's site.

How to do it...

Carry out the following steps:

  1. Create a transformation and use a Text file input step to read the file packt_books.txt. Under the Content tab, uncheck the Header option and as Separator, type |. Under the Fields tab, enter a single String field named text.
  2. From the Transform category, add a Row flattener step.
  3. Double-click the step. As The field to flatten type or select text. Fill the grid with three rows with values title, publishing_date, and price.
  4. Do a preview on the last step. You'll see the following:
How to do it...

You already have the fields as columns! Now, you can go a little further and do some cleansing, as follows:

  1. From the Scripting category add a Regexp Evaluation step.
  2. Configure the step as follows: As Field to evaluate type or select publishing_date. Check the option Create fields for capture groups. As Regular expression: type (Published|Expected):(.+).
  3. In the Capture Group Fields grid, add two rows. In the first row create a new String field named status. In the second, create a Date field named pub_date with Format MMM yyy. In both rows, under the Trim column, select both.
  4. From the Transform category, add a Replace in string step. In the grid, add a row with the value price under the column In stream field, and Our price: under the column Search.
  5. Finally, use a Select values step to change the metadata of the price field: Change it to Number. As Format, type £#.00
  6. Do a preview and you'll see the following:
How to do it...

Note

In the sample file, the months are written in English. Therefore, you put the mask MMM yyy when capturing groups. If you get an error because of the Date format, there is a high possibility that you do not have English as the preferred language in your regional settings. Consequently, Kettle is not able to parse those dates.

How it works...

The Row flattener step is a simple step intended to flatten consecutive rows and is perfect for reading files such as the one in the recipe. In this case, you had a file with book information, each book occupying three consecutive rows. The Row flattener flattened the field text into three different new fields: title, publishing_date, and price.

This way, every three rows, it generated a single one.

Note that if one book has a different number of rows (for example, if it lacks the price row), then you get unexpected results.

Note

The Row flattener flattens the rows as indicated in its setting window, no matter the content of the field being flattened.

There's more...

If you are not sure about the content of the file, you'd best avoid this simple solution and go for a more sophisticated one, for example, a solution that uses a Row denormalizer step.

See also

The sectioned named, Reading unstructured files in this chapter. Look at this recipe if you have to read a file, such as the one in the recipe, but you are not sure of the number of rows occupied by each element.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset