We now know that there are very good reasons for adopting an ETL approach to loading data in QlikView. Now we need to learn how we should go about implementing the approach.
Each part—Extract, Transform, and Load—has its own set of recommendations because each part has a very different function.
Essentially, the approach looks like this:
The approach can be explained as follows:
The final two layers, the transformed QVDs and the final applications, become potential sources for a user's self-service. We can have confidence that users who load data from these layers will be getting access to clean, governed data.
When we are loading data to create QVDs, we will end up calling the Store
statement quite frequently. Also, we tend to drop tables once we have stored them as we don't need that data in the QVW file that has created them. So, we will also call the Drop
statement quite often.
Anytime that we do something quite frequently in the script, it is a good idea to put that into a subroutine that we can call. Here is an example of a script that will perform the Store
and the Drop
operations:
Sub StoreAndDrop(vTableName, vPrefix, vQVDFolder) Store [$(vTableName)] into [$(vQVDFolder)$(vPrefix)$(vTableName).qvd]; Drop Table [$(vTableName)]; End Sub
The subroutine gets passed the name of the table that we want to store, a prefix that we might want to add to the QVD files, and a folder that we want to put the files in—again, this is absolute, relative, or UNC.
Here are some examples of calling this subroutine:
Call StoreAndDrop('Table1', 'E_', 'C:Temp'); Call StoreAndDrop('Table2', 'EX_', '.');
This is an example of a function that you might want to have in an external text file that can be included in all of your scripts, the advantage being that we can have a central place for the maintenance and support of functions.
To include an external file, you would have a statement like this (this one can be created by using the menu in the script editor—Insert | Include Statement):
$(Include=..scriptsstoreanddrop.qvs);
Now, there is a slight problem with this directive in that if the file doesn't exist or there is some other problem reading the file, QlikView will then just ignore the directive (silent fail). Therefore, we should probably think about modifying the statement to read as follows:
$(Must_Include=..scriptsstoreanddrop.qvs);
This will throw an error in the script if there is a problem reading the file—which we probably want to have happen. The script failure will throw an error on the desktop or cause an automated task to fail on the server—unless we are handling the error using the ErrorMode
and ScriptError
variables.
The goal of extracting data is to connect to our database or other sources, and move the required data from source to QVD as quickly as possible. To this end, we will do basically no transformation of the data at all.
To keep things well organized, we should adopt a practice of keeping to a folder structure for our extraction files and the QVDs that they generate.
Within our Extractors
folder, there should be a subfolder for each data source. For example, we will have a subfolder for our Sales Database
and HR System
. We might also have a subfolder for a set of budget files that are stored in Excel somewhere. We will very often have a Shared
subfolder that will contain useful scripts and QVDs that will be shared across multiple sources. For example, we might store our StoreAndDrop
script in the Shared
subfolder structure. Our folder structure may look like the following screenshot:
It is worth noting that if there was only going to be one budget Excel file and it is related to sales, it is perfectly correct to do the practical thing and handle it alongside the other sales data instead of creating a separate set of folders.
Unless an Excel file requires a load process such as CrossTable
, I probably wouldn't create a QVD from it at all. The overhead counteracts any benefits.
Within each subfolder, there will be three new subfolders:
While all extractor scripts will connect to a data source, load the data, then store to QVD, there are some logical differences based on the way that they will operate and the frequency that they will be executed. The following table describes this:
It is important to analyze your loads for these characteristics because you need to appropriately combine or split scripts based on when you should be performing reloads. It is pointless, for example, to include a low frequency script along with a high frequency script in the one script module. Also, it would be good practice to have your simple scripts in a separate module to your complex scripts.
In an environment where there are data batch jobs running—for example, data warehouse ETL processes or financial account processing—we are often limited in our Qlik reloads to a certain time window. In those circumstances, we need to be even more certain that we are not loading unnecessary data.
Execution of the extractors should be very straightforward. Each of the scripts will connect to the data source, load the data, and write the data to a QVD in the QVD folder. At the end of execution, you should have an up-to-date set of QVDs for that data source, ready for transformations.
As a best practice, it is a good idea to also adopt a naming convention for the QVD files that are produced. It can be a good idea to prefix the files with a letter or abbreviation—such as E_
or EX_
—so as to quickly distinguish an extractor QVD from any other. Including the table name in the filename is mandatory. Adding the data source or abbreviation would also be a good step, for example:
E_SalesData_Customer.qvd
The transformation step is where all the magic happens. We will take QVD data (and possibly simple Excel data) and transform that data into conformed dimensions and fact tables.
When transforming, we are going to make use of two folder structures. One will hold the transformation scripts and include files that will actually perform the transformations. The other folder structure will hold the QVDs that are output from those transformations. The reason we split into Transformers
and Models
is that, in theory, we should only have one transformer that creates a QVD, such as a conformed dimension, but that QVD may need to be written out to more than one Model
subfolder.
The subfolders under Transformers
and Models
should be based on the modeling that you have performed in advance—either process or line-of-business based. Have a look at the following screenshot:
It is worth remembering that when we are using a structured folder arrangement like this, then we should use relative paths in our script so that we can move files from development servers, where we will have established identical paths, to test or production servers without having to change the script.
The only rule that we can specify about execution of transformers is that they need to be appropriately scheduled after the extractors that create the QVDs that the transformers depend on. Other than that, we will be applying different business rules to that data and those rules are context-specific.
This is a good point to add a reminder that when creating surrogate keys using the AutoNumber
function, the keys generated can only be relied upon within the same script. We can't create surrogate keys in a QVD in one script and expect them to match surrogate keys created in a different script, even if the original keys were identical. We can, however, use a function such as Hash256
to create consistent surrogate keys between different loads, remembering to apply AutoNumber
on them when loading data into the final application.
It is a good practice to apply a naming convention to the files that are generated in the Models
folders. A common convention is to apply a prefix of FACT_
to a fact table and DIM_
to a dimension table. A source name would not be appropriate here as there may be multiple sources, so just the prefix plus the table name will suffice, for example:
FACT_Sales.qvd; DIM_Organization.qvd; DIM_Calendar.qvd;
For operational reasons, you may wish to partition your fact tables, so a partition indicator would be appropriate:
FACT_Sales_2012.qvd; FACT_Sales_2013.qvd; FACT_Sales_2014.qvd;
If the transformation step has been carried out correctly, there should be very little to do in the UserApp
folder other than to load the QVDs.
As with the other operations, it is a best practice to create a UserApp
folder structure with a subfolder structure that represents either the business process or line-of-business for the apps within it.
This whole UserApp
folder can be mounted on a QlikView server, or each subfolder could be mounted separately.
The load step could be as simple as the following:
LOAD * FROM ..ModelsCampaign ManagementFACT_Sales.qvd (QVD); LOAD * FROM ..ModelsCampaign ManagementDIM_Organization.qvd (QVD); LOAD * FROM ..ModelsCampaign ManagementDIM_Calendar.qvd (QVD);
If the transformation step has been correctly implemented, then the tables should load (optimized load) and all the tables should be associated correctly with no synthetic keys.
The one allowable transformation (which does cause an unoptimized load) that might be performed in this step is the use of the AutoNumber
function to generate surrogate keys. Using it at this stage will ensure that the generated keys will associate correctly as they are all being generated within the same script.