Aggregating data

While QlikView shines in dealing with massive data volumes, sometimes we just do not need to load everything at an atomic level. Data aggregation can, for example, be used in deployments where document segmentation by detail is needed, in which case two documents are created to serve different user groups and analysis needs: one document will have all data with the highest level of detail and another one will have a similar data model but with aggregated (reduced) tables. This way, users are better served by keeping a balance between performance and analysis needs.

In this section, we will implement a document segmentation scenario by aggregating the Flight Data table to create a second document intended for executive users, who only require summary data.

Aggregating the Flight Data table

When aggregating data, the first step is always to define which dimension fields will be left out and which ones will be kept in the summarized table. We should analyze this question by looking at the data from the ground up, that is, by reviewing each dimension from the most granular to the most general. The following list shows the most important dimension fields in the Flight Data table, sorted by granularity:

  • Airport (Origin and Destination)
  • City
  • State
  • Country
  • Aircraft Type
  • Aircraft Group
  • Airline / Carrier
  • Carrier Group
  • Region
  • Month
  • Quarter
  • Year

If we analyze how removing each dimension would individually affect the result of the summarization process, we can find that the most impact would come from removing the Airport dimensions, both Origin and Destination, since those are the ones with the greatest granularity. At the same time, we can say that the Airport dimension does not add much value to the analyses we are looking to deliver in our document, so it's a good choice to leave it out.

Note

Dropping dimensions from the data directly impacts the analyses that can be made in the resulting QlikView document. Therefore, the decision to leave out certain fields for the sake of summarization should always be discussed with the end user.

We could remove additional dimensions, for example, Aircraft Type or Carrier, but as we move up the detail ladder to the most general dimensions, those dimensions become more and more important to accomplish different analyses.

We must add that leaving dimensions out should be a thorough decision process, thinking both in terms of analytical requirements and the aggregation rate we can achieve. For example, removing the Country dimension would not result in any substantial aggregation if we keep the State field. Also, what happens if we remove the Airport dimensions but keep Origin City and Destination City? What happens is, not surprisingly, that the table will not be significantly reduced since both fields keep a close relation and their granularity is almost the same (there is only one airport in most cities). Therefore, and for the sake of simplicity, we will also leave out all city, state, and country fields.

Finally, before proceeding, we should keep in mind how many records the original table has, in order to be able to measure how much reduction we achieved in the summarization. In our case, the Flight Data table originally contains 1,256,075 rows.

Moving on to the aggregation process, follow these steps:

  1. Create a new QlikView document and save it inside the 2.Workbooks folder with the name Transform - Flight Data.qvw.
  2. Go to the Script Editor window, click on the Table Files… button in the tool pane and navigate to the 3.QVDSource folder.
  3. Select the Flight Data.qvd file and click Finish on the File Wizard window.
  4. From the generated Load script, find the lines corresponding to those fields related to origin and destination airports and erase them. The fields we should remove are:
    • %Origin Airport ID
    • %Origin Airport Sequence ID
    • %Origin Airport Market ID
    • %Origin World Area Code
    • %Destination Airport ID
    • %Destination Airport Sequence ID
    • %Destination Airport Market ID
    • %Destination World Area Code Distance
    • Origin Airport Code
    • Origin City
    • Origin State Code
    • Origin State FIPS
    • Origin State
    • Origin Country Code
    • Origin Country
    • Destination Airport Code, Destination City
    • Destination State Code
    • Destination State FIPS
    • Destination State
    • Destination Country Code Destination Country
    • From - To Airport Code
    • From - To Airport ID
    • From - To City
    • From - To State Code
    • From - To State
  5. Next, from the list of fields we have kept, we need to identify those that are dimensions and those that are measures. Our measure fields are:
    • # Departures Scheduled
    • # Departures Performed
    • # Payload
    • # Available Seats
    • # Transported Passengers
    • # Transported Freight
    • # Transported Mail
    • # Ramp-To-Ramp Time
    • # Air Time
  6. The aggregation functions will be applied to these fields, that is, we will sum the # of Departures, or sum the # Transported Passengers. Identify where each of the listed fields are in the created load statement and replace the field name with the following expression:
    Sum(Field Name) as Field Name
    

    where Field Name represents each of the listed measures.

    Note

    Be careful not to remove the comma that separates each field definition and remove the comma from the last listed field, before the From keyword.

  7. Finally, we will add a Group By clause to the end of the Load statement, and list all dimension fields that have been kept in the script, separated by a comma.
  8. We will also add a table name preceding the Load keyword.
  9. In the end, the aggregation script will look like this:
    Flights:
    LOAD
         [%Airline ID],
         [%Carrier Group ID],
         [%Unique Carrier Code],
         [%Unique Carrier Entity Code],
         [%Region Code],
         [%Aircraft Group ID],
         [%Aircraft Type ID],
         [%Aircraft Configuration ID],
         [%Distance Group ID],
         [%Service Class ID],
         [%Datasource ID],
         [Unique Carrier],
         [Carrier Code],
         [Carrier Name],
         Year,
         Period,
         Quarter,
         [Month (#)],
         Month,
         Sum([# Departures Scheduled]) as [# Departures Scheduled],
         Sum([# Departures Performed]) as [# Departures Performed],
         Sum([# Payload]) as [# Payload],
         Sum([# Available Seats]) as [# Available Seats],
         Sum([# Transported Passengers]) as [# Transported Passengers],
         Sum([# Transported Freight]) as [# Transported Freight],
         Sum([# Transported Mail]) as [# Transported Mail],
         Sum([# Ramp-To-Ramp Time]) as [# Ramp-To-Ramp Time],
         Sum([# Air Time]) as [# Air Time]
    FROM
    [..3.QVDSourceFlight Data.qvd]
    (qvd)
    Group By 
    [%Airline ID], [%Carrier Group ID], [%Unique Carrier Code],
    [%Unique Carrier Entity Code], [%Region Code], [%Aircraft Group ID],
    [%Aircraft Type ID], [%Aircraft Configuration ID], [%Distance Group ID],
    [%Service Class ID], [%Datasource ID], [Unique Carrier], [Carrier Code],
    [Carrier Name], Year, Period, Quarter, [Month (#)], Month;
  10. Next, we will just save the changes and reload the script.

The resulting table will turn our 1,256,075 rows into only 100,091. A brief example of what just happened is shown in the following screenshot:

Aggregating the Flight Data table

Notice how the totals remain the same for both tables.

A smaller table will occupy fewer resources (RAM and CPU) and, therefore, calculations will be faster. If the performance gain attained with data aggregation doesn't mean reducing business value and/or functionality for the end user, then it's a winning approach any day.

The Transformation output

We have loaded the base QVD containing flight data and transformed it by applying aggregations, now what? Well, the next steps would be to store the transformed table, using the store command, into a new QVD file that will reside in the 3.QVDsTransformed folder.

After that, a new data model could be created in the Presentation Layer based on the Airline Operations document, but using the newly aggregated QVD and without the Origin and Destination dimensions. This new QlikView document is intended to serve the users who only need summarized information about the Airline Operations document.

Aggregation functions

Of course, QlikView offers more aggregation options than summing. The most commonly used options are shown in the following table:

Function

Explanation

Example

Sum()

Sums numeric expressions. Optionally a DISTINCT qualifier can be added, this will cause the function to ignore duplicate values.

Sum(DISTANCE)

Sum(DISTINCT AIR_TIME)

Min()

Returns the lowest value within a numeric range. Optionally a rank can be specified, this will return the nth lowest number. So 2 returns the second lowest number.

Min(DISTANCE)

Min(DISTANCE, 2)

Max()

Returns the highest value within a numeric range. Optionally a rank can be specified, this will return the nth highest number. So 2 returns the second highest number.

Max(PASSENGERS)

Max(PASSENGERS, 2)

Only()

If the aggregation of a value returns only a single value, that value is returned, otherwise the function returns null. For example, when an expression contains the values {1, 1, 1} then the Only() function will return 1. If an expression contains the values {1, 2, 3} then the Only() function returns null.

Only(SEATS)

MinString()

Similar to the Min() function, but applied to text strings. Also, it does not have the optional rank parameter.

MinString(MANUFACTURER)

MaxString()

Similar to the Max() function, but applied to text strings. Also, it does not have the optional rank parameter.

MaxString(MANUFACTURER)

Concat()

Concatenates all the values of an expression into a single string, which is separated by a delimiter given as a function's parameter. Has an optional DISTINCT qualifier which will set the function to ignore duplicate values.

Concat(AIRPORT_NAME, ';')

Concat(DISTINCT MANUFACTURER, ',')

Count()

Counts the number of items in the input expression. Has an optional DISTINCT qualifier that sets the function to ignore duplicate values. Instead of an expression an * (asterisk) can also be used to count the number of rows.

Count(AIRCRAFT_NAME)

Count(DISTINCT AIRCRAFT_NAME)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset