Using the name of a file (or part of it) as a field

There are some occasions where you need to include the name of a file as a column in your dataset for further processing. With Kettle, you can do it in a very simple way.

In this example, you have several text files about camping products. Each file belongs to a different category and you know the category from the filename. For example, tents.txt contains tent products. You want to obtain a single dataset with all the products from these files including a field indicating the category of every product.

Getting ready

In order to run this exercise, you need a directory (campingProducts) with text files named kitchen.txt, lights.txt, sleeping_bags.txt, tents.txt, and tools.txt. Each file contains descriptions of the products and their price separated with a |. For example:

Swedish Firesteel - Army Model|$19.97
Mountain House #10 Can Freeze-Dried Food|$53.50
Coleman 70-Quart Xtreme Cooler (Blue)|$59.99
Kelsyus Floating Cooler|$26.99
Lodge LCC3 Logic Pre-Seasoned Combo Cooker|$41.99
Guyot Designs SplashGuard-Universal|$7.96

How to do it...

Carry out the following steps:

  1. Create a new transformation
  2. Drop a Text file input step into the work area and use it to read the files: Under the File tab, type or browse to the campingProducts directory in the File or directory textbox, and use .*.txt as Regular Expression. Click on the Add button.
  3. Under the Content tab, type | as the Separator and complete the Fields tab as follows:
    How to do it...
  4. Under the Additional output fields tab, type filename in the field Short filename field.
  5. Previewing this step, you can see that there is a new field named filename with the name of the file (for example: kitchen.txt).
  6. Now, you must split the filename text to get the category. Add a Split Fields from the Transform category, double-click on it and fill the setting windows, as shown in the following screenshot:
    How to do it...
  7. Previewing the last step of the transformation, you will see a dataset with the camping products, their price, and also a column named category with the proper product category.

How it works...

This recipe showed you the way to convert the names of the files into a new field named category. The source directory you entered in the Text file input step contains several files whose names are the categories of the products. Under the Additional output fields tab, you incorporated the Short filename as a field (for example tents.txt); you could also have included the extension, size, or full path among other fields.

The next step in the transformation, a Split Fields step uses a period (.) as the Delimiter value to use from the field only the first part, which is the category (tents in the example). It eliminates the second part, which is the extension of the filename (txt). If you don't want to discard the extension, you must add another field in the grid (for example, a field named fileExtension). Note that for this field, you set the type, but you can also specify a format, length, and so on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset