Reading a simple file

In this recipe, you will learn the use of the Text file input step. In the example, you have to read a simple file with a list of authors' information like the following:

"lastname","firstname","country","birthyear"
"Larsson","Stieg","Swedish",1954
"King","Stephen","American",1947
"Hiaasen","Carl ","American",1953
"Handler","Chelsea ","American",1975
"Ingraham","Laura ","American",1964

Getting ready

In order to continue with the exercise, you must have a file named authors.txt similar to the one shown in the introduction section of this recipe.

How to do it...

Carry out the following steps:

  1. Create a new transformation.
  2. Drop a Text file input step to the canvas.
  3. Now, you have to type the name of the file (authors.txt) with its complete path. You do it in the File or directory textbox.

    Tip

    Alternatively, you can select the file by clicking on the Browse button and looking for the file. The textbox will be populated with the complete path of the file.

  4. Click on the Add button. The complete text will be moved from the File or directory textbox to the grid.
  5. Select the Content tab and fill in the required fields, as shown in the following screenshot:
    How to do it...
  6. Select the Fields tab and click on the Get Fields button to get the definitions of the fields automatically. The grid will be populated, as shown in the following screenshot:
    How to do it...

    Tip

    Kettle doesn't always guess the data types, size, or format as expected. So, after getting the fields, you may change the data to what you consider more appropriate.

    Note

    When you read a file, it's not mandatory to keep the names of the columns as they are in the file. You are free to change the names of the fields as well.

  7. Click on the Preview button and you will see some sample rows built with the data in your file.

How it works...

You use the Text file input in order to read text files, in this case, the authors.txt file.

Looking at the content of the file, you can see that the first line contains the header of the columns. In order to recognize that header, you have to check the Header checkbox under the Content tab, and type 1 in the Number of header lines textbox. You also have to indicate the field's separator. The separator can be made of one or more characters the most used being the semicolon, colon, or a tab. Finally, you can indicate the Enclosure string, in this case,".

PDI takes all that information and uses it to parse the text file and fill the fields correctly.

There's more...

To work with these kinds of delimited text files, you could choose the CSV file input step. This step has a less powerful configuration, but it provides better performance.

If you explore the tabs of the Text file input setting window, you will see that there are more options to set, but the ones just explained are by far the most used. But there are a couple of additional features that may interest you:

Alternative notation for a separator

Instead of typing the separator for the fields, you can use the following notation:

$[H1, H2, ...]

Where the values H1, H2, ... are the hexadecimal codes for the separators. For example, for specifying a tilde (~) as the separator, instead of typing it, you could type $[7E]. However, this notation makes more sense when your separators are non printable characters.

For the enclosure string the hexadecimal notation is also allowed.

About file format and encoding

If you are trying to read a file without success, and you have already checked the most common settings, that is, the name of the file, the header, the separator and the fields, you should take a look at and try to fix the other available settings. Among those, you have Format and Encoding.

Format allows you to specify the format of your file(s): DOS (default value) or UNIX. If your file has a Unix format, you should change this setting. If you don't know the format, but you cannot guarantee that the format will be DOS, you can choose the mixed option.

Encoding allows you to specify the character encoding to use. If you leave it blank, Kettle will use the default encoding on your system. Alternatively, if you know the encoding and it is different from the default, you should select the proper option from the drop-down list.

About data types and formats

When you read a file and tell Kettle which fields to get from that file, you have to provide at least a name and a data type for those fields. In order to tell Kettle how to read and interpret the data, you have more options. Most of them are self-explanatory, but the format, length, and precision deserve an explanation:

If you are reading a number, and the numbers in your file have separators, dollar signs, and so on, you should specify a format to tell Kettle how to interpret that number. The format is a combination of patterns and symbols as explained in the Sun Java API documentation at the following URL:

http://java.sun.com/javase/6/docs/api/java/text/DecimalFormat.html

If you don't specify a format for your numbers, you may still provide a length and precision. Length is the total number of significant figures, while precision is the number of floating point digits.

Note

If you don't specify format, length, or precision, Kettle will do its best to interpret the number, but this could lead to unexpected results.

In the case of dates, the same thing happens. When your text file has a date, you have to select or type a format mask, so Kettle can recognize the different components of the date in the field. For a complete reference on date formats, check the Sun Java API documentation, located at the following URL:

http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html

Altering the names, order, or metadata of the fields coming from the file

If you want to reorder or delete some of the columns you read, you have to add another step to the transformation. Suppose you want to move the country name to the end of the list of columns, changing it to a more suitable field name, such as nationality.

In this case, add a Select values step. The Select values step allows you to select, rename, reorder, and delete fields, or change the metadata of a field.

Under the Select & Alter tab, select all the fields and manipulate those according to your needs as shown in the following example:

Altering the names, order, or metadata of the fields coming from the file

Tip

If you just want to rename the columns, you don't need a Select values step. You can do it in the Text file input step by typing the names manually.

Reading files with fixed width fields

In the example, you read a CSV (Comma Separated Values) file type. This is the default value for the type of file, as you can see under the Content tab. You have another option here named Fixed for reading files with fixed-width columns. If you choose this option, a different helper GUI will appear when you click on the Get fields button. In the wizard, you can visually set the position for each of your fields.

There is also another step named Fixed file input in the Input category to apply in these cases. It provides better performance and has a simpler, but less flexible configuration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset