"Data! Data! Data!" he cried impatiently. "I can't make bricks without clay."
— Sherlock Holmes (Arthur Conan Doyle), The Adventure of the Copper Beeches
In this chapter, beginners to QlikView development will be shown how to connect to different data sources with the QlikView script, load tables of data, transform that data, and create charts and other objects. However, in the real world of QlikView application development, it will be very rare that you will create an application that contains the whole process, from data source to final visualizations, within one QlikView document.
Extract, transform, and load (ETL) is a standard process within data warehousing; moving and transforming data from different data locations into the final dimensional model tables.
In this chapter, we will be looking at creating best practice ETL techniques using QlikView tools. Initially, we will look at how to do this using the QlikView script. At the end of this chapter, we will look at using QlikView's graphical ETL tool—Expressor—to provision data for QlikView.
These are the topics that will be covered in this chapter:
There are a few things that we need to remind ourselves of before we can fully grasp the concepts covered in this chapter.
QlikView is data-agnostic. It doesn't care where the data comes from, all QlikView cares about is whether the data is numeric or alphanumeric, and if it is numeric, does it have an alphanumeric representation that needs to be stored. Hence, for practical discussion purposes, there are only two datatypes in QlikView—numeric and dual.
QlikView does actually recognize both integer and float values and stores them accordingly, with floats taking up more storage bytes. If the numeric values have a format, then they are stored as duals—with the number and the formatted string stored together. The Floor function will not only remove decimals from a number, leaving just an integer, but it will also remove any formatting so it will reduce the amount of space needed to store the values.
This is sometimes difficult for people coming from a database world, where there can be great difficulty in moving data from one place to another. Database ETL designers will have to worry about whether the source data is one length of string versus the target. In QlikView, we need not worry; it is just going to be text.
There are sometimes issues due to this, such as when there is an ambiguity about what the data is, but it does save a lot of time. This is especially true when we need to bring data together from multiple data sources. We may have sales information coming from an ERP system, user information coming from an HR system, and customer contact information coming from a CRM system. Then, add to that budget information from Excel. Because we don't care about the strict datatypes, QlikView can handle all of this data easily. We can start building our applications and delivering results very quickly.
One of the reasons that QlikView can be better than traditional reporting solutions is that QlikView takes a snapshot of the data into the memory and users will query that snapshot. This takes a load off the core systems because the users' queries are not continually running against a production database. However, to make this an even better situation, we need to make sure that QlikView plays nicely with the database and we are not attempting to load 20 million transaction records every 30 minutes. That is behavior that makes us very unpopular with DBAs very quickly.
The data-from-anywhere ability of QlikView is also a great advantage over many other systems, where you might be limited to only connecting to one data source at a time and are forced to write ETL to move other data sources into the common source. Some other systems have the ability to combine data from multiple sources, but often not in such a straightforward way. One of the reasons ETL has developed as a software category is the ability to report on data from multiple sources. Companies had no option but to move the data into a central warehouse where reports could be run. There are, of course, some very good techniques and practices that have come out of ETL processing that we can apply to QlikView implementations—techniques that will save us from the wrath of the DBA!
One technique that is often quickly forgotten by QlikView developers, if they ever knew about it in the first place, is the BINARY
load. This statement will load all of the data of a QlikView file (.qvw
) into another—the data tables, symbol tables, and so forth. Once they have been loaded into the new file, you can use it as is, add additional data, remove and reload tables, or perform any other processing that you want.
Because you are loading another file's data tables, symbol tables, and other tables into a new file, there is one restriction in that the BINARY
statement must be the very first statement in the script, as shown in the following screenshot:
Using this technique, you might have a chain of binary loading documents, each one loading from the one before, but then adding some new data or even removing rows or whole tables, to make it more unique. Another use case is to have completely different documents from a frontend visualization point of view, with different audiences, that share the same data model—one document can load the data while the other documents simply binary load from the original.
We already talked about automatic and manual concatenation in the Joining data section Chapter 2, QlikView Data Modeling. We will recall that if two tables with the same number of identically named fields are loaded, then QlikView will automatically concatenate those tables.
If we load data from file-based sources using wildcards in the filenames, QlikView will attempt to load each of the matching files. As long as the files contain the same set of fields, the rows in each file will be automatically concatenated, for example:
Load Field1, Field2, Field3 From File*.txt (txt, utf8, embedded labels, delimiter is ',', msq);
As long as every file that matches the wildcard File*.txt
contains the three fields listed, they will all be concatenated.
So, if similar files can be loaded using a simple wildcard, what if there are differences, perhaps even just a field or two, but you would still like to concatenate the fields? This might be a common use case if you are loading files that have been generated over time but have had new fields added to them during that period. The older files won't have the fields, so rather than try and retro-fit those files, we can handle them like this:
// Assign a variable with blank text Set vConcatenateOrders=''; FOR Each vFilename in FileList('c:DataFilter*.txt') Orders: $(vConcatenateOrders) LOAD * FROM $(vFilename) (txt, utf8, embedded labels, delimiter is ',', msq); // Update the variable with a concatenate statement Set vConcatenateOrders='Concatenate (Orders)'; Next
The For Each
statement combined with the FileList
function will loop through all of the values that are returned by the file specification. The full absolute path (for example, C:DataOrderExport2.csv
) will be assigned to the vFilename
variable.
A QlikView Data (QVD) file is a file format that QlikView uses to store a table of data to disk. Only one table can be stored in each QVD file.
A QVD contains three parts:
So, basically the QVD file is an on-disk representation of how that data is stored in memory. For this reason, loading data from a QVD file back into memory is very fast. In fact, if you do no transformation to the data in the load script, then the load is essentially straight from disk into memory. This is the fastest way of getting a single table of data into QlikView.
Even if you need to transform the data, or use where
clauses, the data load is still very fast—as fast as from any other table files. There are a couple of operations that can be performed on a QVD that do not interfere with the fastest, most optimized load:
As
Where
clause using a single Exists
When we have loaded data into a table, we can store that table to an external file using the Store
command. The basic syntax of the command is like this:
Store TableName into path_to_file (format);
This is the syntax that would be used 99 times out of 100. There is a slightly more advanced syntax, where we can specify the fields to be stored:
Store Field1, Field2, Field3 from TableName into path_to_file (format)
The path will be any valid absolute, relative, or UNC path to the file that you wish to create. The format is one of three values:
Format |
Description |
---|---|
|
This creates a file of type QVD as described previously. |
|
This creates a comma-separated Unicode text file. |
|
An XML-based table format that QlikView can read. Because this is an open format, it is often used by third-party organizations to export data to be loaded into QlikView. |
If the format is omitted, qvd
will be used. Because of that, you will usually see Store
statements without the format specified. A best practice would be to always include the format, even if it is QVD.
Some examples of valid Store
statements are:
Store Sales into D:QlikViewQVDSales.qvd; Store OrderID, OrderDate, CustomerID, ProductID, SalesValue From Sales into ..QVDSales.qvd (qvd); Store Customer into \qvserverdataCustomer.qvx (qvx); Store Product into ..csvProduct.csv (txt);
One of the things that new developers often ask about QVDs is, "why?". They wonder why they need to use QVDs. They know that they can connect to a database and read data and they feel that they can do that again and again and don't see any reason why they need to bother writing the data to a QVD file first. There are, however, several very good reasons to store data in QVDs:
Just from a development point of view, you will find that you are performing reloads again and again as you are perfecting your data model. If you are reloading from a slow database connection, this can be painful. If you create local QVDs, then your development efforts will proceed a lot faster.