There are some occasions where you need to include the name of a file as a column in your dataset for further processing. With Kettle, you can do it in a very simple way.
In this example, you have several text files about camping products. Each file belongs to a different category and you know the category from the filename. For example, tents.txt
contains tent products. You want to obtain a single dataset with all the products from these files including a field indicating the category of every product.
In order to run this exercise, you need a directory (campingProducts) with text files named kitchen.txt, lights.txt, sleeping_bags.txt, tents.txt
, and tools.txt
. Each file contains descriptions of the products and their price separated with a |
. For example:
Swedish Firesteel - Army Model|$19.97 Mountain House #10 Can Freeze-Dried Food|$53.50 Coleman 70-Quart Xtreme Cooler (Blue)|$59.99 Kelsyus Floating Cooler|$26.99 Lodge LCC3 Logic Pre-Seasoned Combo Cooker|$41.99 Guyot Designs SplashGuard-Universal|$7.96
Carry out the following steps:
campingProducts
directory in the File or directory textbox, and use .*.txt
as Regular Expression. Click on the Add button. |
as the Separator and complete the Fields tab as follows: filename
in the field Short filename field. filename
with the name of the file (for example: kitchen.txt)
. category
with the proper product category.This recipe showed you the way to convert the names of the files into a new field named category
. The source directory you entered in the Text file input step contains several files whose names are the categories of the products. Under the Additional output fields tab, you incorporated the Short filename as a field (for example tents.txt)
; you could also have included the extension, size, or full path among other fields.
The next step in the transformation, a Split Fields step uses a period (.) as the Delimiter value to use from the field only the first part, which is the category (tents in the example). It eliminates the second part, which is the extension of the filename (txt). If you don't want to discard the extension, you must add another field in the grid (for example, a field named fileExtension)
. Note that for this field, you set the type, but you can also specify a format, length, and so on.