In this recipe, you will learn the use of the Text file input step. In the example, you have to read a simple file with a list of authors' information like the following:
"lastname","firstname","country","birthyear" "Larsson","Stieg","Swedish",1954 "King","Stephen","American",1947 "Hiaasen","Carl ","American",1953 "Handler","Chelsea ","American",1975 "Ingraham","Laura ","American",1964
In order to continue with the exercise, you must have a file named authors.txt
similar to the one shown in the introduction section of this recipe.
Carry out the following steps:
You use the Text file input in order to read text files, in this case, the authors.txt
file.
Looking at the content of the file, you can see that the first line contains the header of the columns. In order to recognize that header, you have to check the Header checkbox under the Content tab, and type 1
in the Number of header lines textbox. You also have to indicate the field's separator. The separator can be made of one or more characters the most used being the semicolon, colon, or a tab. Finally, you can indicate the Enclosure string, in this case,"
.
PDI takes all that information and uses it to parse the text file and fill the fields correctly.
To work with these kinds of delimited text files, you could choose the CSV file input step. This step has a less powerful configuration, but it provides better performance.
If you explore the tabs of the Text file input setting window, you will see that there are more options to set, but the ones just explained are by far the most used. But there are a couple of additional features that may interest you:
Instead of typing the separator for the fields, you can use the following notation:
$[H1, H2, ...]
Where the values H1, H2, ..
. are the hexadecimal codes for the separators. For example, for specifying a tilde (~) as the separator, instead of typing it, you could type $[7E]. However, this notation makes more sense when your separators are non printable characters.
For the enclosure string the hexadecimal notation is also allowed.
If you are trying to read a file without success, and you have already checked the most common settings, that is, the name of the file, the header, the separator and the fields, you should take a look at and try to fix the other available settings. Among those, you have Format and Encoding.
Format allows you to specify the format of your file(s): DOS (default value) or UNIX. If your file has a Unix format, you should change this setting. If you don't know the format, but you cannot guarantee that the format will be DOS, you can choose the mixed option.
Encoding allows you to specify the character encoding to use. If you leave it blank, Kettle will use the default encoding on your system. Alternatively, if you know the encoding and it is different from the default, you should select the proper option from the drop-down list.
When you read a file and tell Kettle which fields to get from that file, you have to provide at least a name and a data type for those fields. In order to tell Kettle how to read and interpret the data, you have more options. Most of them are self-explanatory, but the format, length, and precision deserve an explanation:
If you are reading a number, and the numbers in your file have separators, dollar signs, and so on, you should specify a format to tell Kettle how to interpret that number. The format is a combination of patterns and symbols as explained in the Sun Java API documentation at the following URL:
http://java.sun.com/javase/6/docs/api/java/text/DecimalFormat.html
If you don't specify a format for your numbers, you may still provide a length and precision. Length is the total number of significant figures, while precision is the number of floating point digits.
If you don't specify format, length, or precision, Kettle will do its best to interpret the number, but this could lead to unexpected results.
In the case of dates, the same thing happens. When your text file has a date, you have to select or type a format mask, so Kettle can recognize the different components of the date in the field. For a complete reference on date formats, check the Sun Java API documentation, located at the following URL:
http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html
If you want to reorder or delete some of the columns you read, you have to add another step to the transformation. Suppose you want to move the country
name to the end of the list of columns, changing it to a more suitable field name, such as nationality
.
In this case, add a Select values step. The Select values step allows you to select, rename, reorder, and delete fields, or change the metadata of a field.
Under the Select & Alter tab, select all the fields and manipulate those according to your needs as shown in the following example:
In the example, you read a CSV (Comma Separated Values) file type. This is the default value for the type of file, as you can see under the Content tab. You have another option here named Fixed for reading files with fixed-width columns. If you choose this option, a different helper GUI will appear when you click on the Get fields button. In the wizard, you can visually set the position for each of your fields.
There is also another step named Fixed file input in the Input category to apply in these cases. It provides better performance and has a simpler, but less flexible configuration.