Processing rows differently based on the row number

There will be some situations where you will need to process the data differently depending on the position or number of each row.

Let's assume that you have a bookstore and want to know the top five bestsellers books, the following 10 bestsellers, and the rest of the books for different purposes (for example, to do a differentiated marketing promotion for each group). To do this, you will divide the list of books into different groups depending on their sales.

Getting ready

You need an Excel spreadsheet file containing a list of books with the following columns:

  • title
  • id_author
  • price, id_title
  • genre
  • sales

This last column represents the quantity of books sold in the last period.

You can download a sample file from the book's website.

How to do it...

Carry out the following steps:

  1. Create a new transformation and drag an Excel Input step from the Input category.
  2. Under the Files tab, browse to and select the sales_books.xls file.
  3. Complete the Fields tab with the following values:
    How to do it...
  4. Add a Sort rows step from the Transform category. Complete the step grid with the sales Fieldname. Type N in the Ascending column.
  5. Add an Add sequence step from the Transform category. Type rank in the Name of value textbox.
  6. By previewing this step, you will obtain a list of books ranked by their sales.
  7. Add two Filter rows steps and three Dummy steps (all from the Flow category) and create the hops, as depicted in the following diagram:
    How to do it...
  8. In the first Filter rows, set the following condition: rank <= 5.
  9. In the last Filter rows step add the condition rank <= 15.
  10. The Dummy 1 step represents the 5 best-selling books. For example:
    How to do it...
  11. The Dummy 2 step represents the next 10 best-selling books.
  12. The rest of the books can bee seen in the Dummy 3 step.
  13. You can do a preview of each of these Dummy steps and verify the results.

How it works...

This recipe reads the sales_books.xls file to create a dataset of the book titles along with their sales information. The Sort rows step is necessary to order the books by sales starting with the best seller.

Then, you dropped an Add sequence step to enumerate the rows. In this case, the field you added represents the ranking value. The best selling book will have the number one.

At this moment, you have the list of books ranked by their sales. Now, you only have to filter the books based on their ranks. You do it by using the Filter rows step.

The first Filter rows step uses the condition rank <= 5 to get the top five best-selling books. The rest of the books will be filtered again, now with the condition rank <= 15; this will bring the rows ranked from 6 to 15. The remaining books, those with a rank greater than 15, will go to the last Dummy step.

There's more...

In the recipe, you enumerated the rows and then you did different things based on the row number. There are also some specific use cases, which are explained in the following subsections.

Identifying specific rows

Suppose that you only want to keep the books with rank 15 to 20 and discard the rest. In this case, you don't have to add the Add sequence step and the Filter rows step afterward. There is a simpler way of doing that. There is also a step named Sample rows in the Statistics category that allows picking specific rows from a dataset. For example, filling the Lines range textbox with 1..5,9,15..20, you will get:

  • The rows 1 to 5
  • The row 9
  • The rows 15 to 20

The rest of the lines will be discarded. For the preceding example, you should just type 15..20.

Identifying the last row in the stream

Suppose that you want to know which book sold the least. In this case, you cannot filter by row number because you don't know how many books there are. In this case, instead of enumerating the rows, you can use the Identify last row in a stream step from the Flow category.

In this step, you only have to type a value for the Result fieldname textbox. When you execute the transformation, this new field will return Y for the last row and N otherwise. In the example, you can know which the least sold book was, by filtering the row where the field is equal to Y.

Avoiding using an Add sequence step to enumerate the rows

If you need to enumerate the rows just after reading the data, then you don't need to add an Add sequence step. In several of the input steps, such as Text file input or Get data from XML, you have a checkbox named Rownum in output? under the Content tab. This allows you to create a new field with a sequence for the rows. The name of this new field must be typed in the Rownum fieldname textbox.

This also applies when you need to rank the rows as in the recipe, and your input data is already ordered.

See also

The recipe named Splitting a stream into two or more streams based on a condition in this chapter. This recipe has all the details about the Filter row step.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset