Using an ETL approach to create QVD data layers

We now know that there are very good reasons for adopting an ETL approach to loading data in QlikView. Now we need to learn how we should go about implementing the approach.

Each part—Extract, Transform, and Load—has its own set of recommendations because each part has a very different function.

Essentially, the approach looks like this:

Using an ETL approach to create QVD data layers

The approach can be explained as follows:

  1. Extract the data from data sources into QVDs.
  2. Transform the data from the initial QVDs into transformed fact tables and conformed dimensions.
  3. Load the transformed QVDs into the final applications.

The final two layers, the transformed QVDs and the final applications, become potential sources for a user's self-service. We can have confidence that users who load data from these layers will be getting access to clean, governed data.

Creating a StoreAndDrop subroutine

When we are loading data to create QVDs, we will end up calling the Store statement quite frequently. Also, we tend to drop tables once we have stored them as we don't need that data in the QVW file that has created them. So, we will also call the Drop statement quite often.

Anytime that we do something quite frequently in the script, it is a good idea to put that into a subroutine that we can call. Here is an example of a script that will perform the Store and the Drop operations:

Sub StoreAndDrop(vTableName, vPrefix, vQVDFolder)
  Store [$(vTableName)] into [$(vQVDFolder)$(vPrefix)$(vTableName).qvd];
  Drop Table [$(vTableName)];
End Sub

The subroutine gets passed the name of the table that we want to store, a prefix that we might want to add to the QVD files, and a folder that we want to put the files in—again, this is absolute, relative, or UNC.

Here are some examples of calling this subroutine:

Call StoreAndDrop('Table1', 'E_', 'C:Temp');
Call StoreAndDrop('Table2', 'EX_', '.');

This is an example of a function that you might want to have in an external text file that can be included in all of your scripts, the advantage being that we can have a central place for the maintenance and support of functions.

To include an external file, you would have a statement like this (this one can be created by using the menu in the script editor—Insert | Include Statement):

$(Include=..scriptsstoreanddrop.qvs);

Now, there is a slight problem with this directive in that if the file doesn't exist or there is some other problem reading the file, QlikView will then just ignore the directive (silent fail). Therefore, we should probably think about modifying the statement to read as follows:

$(Must_Include=..scriptsstoreanddrop.qvs);

This will throw an error in the script if there is a problem reading the file—which we probably want to have happen. The script failure will throw an error on the desktop or cause an automated task to fail on the server—unless we are handling the error using the ErrorMode and ScriptError variables.

Extracting data

The goal of extracting data is to connect to our database or other sources, and move the required data from source to QVD as quickly as possible. To this end, we will do basically no transformation of the data at all.

Creating an extractor folder structure

To keep things well organized, we should adopt a practice of keeping to a folder structure for our extraction files and the QVDs that they generate.

Within our Extractors folder, there should be a subfolder for each data source. For example, we will have a subfolder for our Sales Database and HR System. We might also have a subfolder for a set of budget files that are stored in Excel somewhere. We will very often have a Shared subfolder that will contain useful scripts and QVDs that will be shared across multiple sources. For example, we might store our StoreAndDrop script in the Shared subfolder structure. Our folder structure may look like the following screenshot:

Creating an extractor folder structure

Note

It is worth noting that if there was only going to be one budget Excel file and it is related to sales, it is perfectly correct to do the practical thing and handle it alongside the other sales data instead of creating a separate set of folders.

Unless an Excel file requires a load process such as CrossTable, I probably wouldn't create a QVD from it at all. The overhead counteracts any benefits.

Within each subfolder, there will be three new subfolders:

Subfolder

Purpose

Includes

This folder will hold include text files containing a QlikView script. A common use of such files is to store connection strings or variable declarations that might be shared across multiple files. By keeping such information in a separate include file, we can quickly change values without having to edit multiple QVWs.

QScript

This folder will hold either QVW files that will be executed by a server/publisher reload task or text files (usually with a QVS extension) containing a script that we will run via a publisher task. In either case, the purpose of the script will be to connect to the data sources, load the data, and store the data into QVD files.

QVD

The destination folder for the data generated by the scripts.

Differentiating types of scripts

While all extractor scripts will connect to a data source, load the data, then store to QVD, there are some logical differences based on the way that they will operate and the frequency that they will be executed. The following table describes this:

Script type

Description

Low frequency

The data that is being loaded does not change frequently or at all. Therefore, there is little point in refreshing the QVD on a very frequent (for example, daily) basis. A good example of this might be a calendar table, which we can use to calculate many years into the past and many years into the future. Another example may be a department structure that doesn't really change very frequently. We can refresh the QVD every so often, automatically or manually, but not frequently. The complexity of the script is irrelevant because it runs so infrequently.

Simple, high frequency

Common for dimensional data, we will connect to the data source, load the data, and store straight to QVD with little or no additional calculation. We will do this frequently because we need to make sure that any changes in such data are reflected in the models. However, the size of these dimension tables (relatively small compared to the fact tables) means that loading the entire table every time is not unfeasible.

Complex, high frequency

Usually applied to fact tables where loading the entire table every time is unfeasible, we need to apply additional logic so as to only load from the database those records that we need to get now. We will then combine those records with records that we already have in a QVD so as to create the final extract QVD.

It is important to analyze your loads for these characteristics because you need to appropriately combine or split scripts based on when you should be performing reloads. It is pointless, for example, to include a low frequency script along with a high frequency script in the one script module. Also, it would be good practice to have your simple scripts in a separate module to your complex scripts.

In an environment where there are data batch jobs running—for example, data warehouse ETL processes or financial account processing—we are often limited in our Qlik reloads to a certain time window. In those circumstances, we need to be even more certain that we are not loading unnecessary data.

Executing the extractors

Execution of the extractors should be very straightforward. Each of the scripts will connect to the data source, load the data, and write the data to a QVD in the QVD folder. At the end of execution, you should have an up-to-date set of QVDs for that data source, ready for transformations.

As a best practice, it is a good idea to also adopt a naming convention for the QVD files that are produced. It can be a good idea to prefix the files with a letter or abbreviation—such as E_ or EX_—so as to quickly distinguish an extractor QVD from any other. Including the table name in the filename is mandatory. Adding the data source or abbreviation would also be a good step, for example:

E_SalesData_Customer.qvd

Transforming data

The transformation step is where all the magic happens. We will take QVD data (and possibly simple Excel data) and transform that data into conformed dimensions and fact tables.

Creating a transformer and model folder structure

When transforming, we are going to make use of two folder structures. One will hold the transformation scripts and include files that will actually perform the transformations. The other folder structure will hold the QVDs that are output from those transformations. The reason we split into Transformers and Models is that, in theory, we should only have one transformer that creates a QVD, such as a conformed dimension, but that QVD may need to be written out to more than one Model subfolder.

The subfolders under Transformers and Models should be based on the modeling that you have performed in advance—either process or line-of-business based. Have a look at the following screenshot:

Creating a transformer and model folder structure

Note

It is worth remembering that when we are using a structured folder arrangement like this, then we should use relative paths in our script so that we can move files from development servers, where we will have established identical paths, to test or production servers without having to change the script.

Executing transformers

The only rule that we can specify about execution of transformers is that they need to be appropriately scheduled after the extractors that create the QVDs that the transformers depend on. Other than that, we will be applying different business rules to that data and those rules are context-specific.

Note

This is a good point to add a reminder that when creating surrogate keys using the AutoNumber function, the keys generated can only be relied upon within the same script. We can't create surrogate keys in a QVD in one script and expect them to match surrogate keys created in a different script, even if the original keys were identical. We can, however, use a function such as Hash256 to create consistent surrogate keys between different loads, remembering to apply AutoNumber on them when loading data into the final application.

It is a good practice to apply a naming convention to the files that are generated in the Models folders. A common convention is to apply a prefix of FACT_ to a fact table and DIM_ to a dimension table. A source name would not be appropriate here as there may be multiple sources, so just the prefix plus the table name will suffice, for example:

FACT_Sales.qvd;
DIM_Organization.qvd;
DIM_Calendar.qvd;

For operational reasons, you may wish to partition your fact tables, so a partition indicator would be appropriate:

FACT_Sales_2012.qvd;
FACT_Sales_2013.qvd;
FACT_Sales_2014.qvd;

Loading data

If the transformation step has been carried out correctly, there should be very little to do in the UserApp folder other than to load the QVDs.

Creating a UserApp folder structure

As with the other operations, it is a best practice to create a UserApp folder structure with a subfolder structure that represents either the business process or line-of-business for the apps within it.

This whole UserApp folder can be mounted on a QlikView server, or each subfolder could be mounted separately.

Executing the load step

The load step could be as simple as the following:

LOAD * FROM ..ModelsCampaign ManagementFACT_Sales.qvd (QVD);
LOAD * FROM ..ModelsCampaign ManagementDIM_Organization.qvd (QVD);
LOAD * FROM ..ModelsCampaign ManagementDIM_Calendar.qvd (QVD);

If the transformation step has been correctly implemented, then the tables should load (optimized load) and all the tables should be associated correctly with no synthetic keys.

The one allowable transformation (which does cause an unoptimized load) that might be performed in this step is the use of the AutoNumber function to generate surrogate keys. Using it at this stage will ensure that the generated keys will associate correctly as they are all being generated within the same script.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset