Chapter 3. Best Practices for Loading Data

"Data! Data! Data!" he cried impatiently. "I can't make bricks without clay."

— Sherlock Holmes (Arthur Conan Doyle), The Adventure of the Copper Beeches

In this chapter, beginners to QlikView development will be shown how to connect to different data sources with the QlikView script, load tables of data, transform that data, and create charts and other objects. However, in the real world of QlikView application development, it will be very rare that you will create an application that contains the whole process, from data source to final visualizations, within one QlikView document.

Extract, transform, and load (ETL) is a standard process within data warehousing; moving and transforming data from different data locations into the final dimensional model tables.

In this chapter, we will be looking at creating best practice ETL techniques using QlikView tools. Initially, we will look at how to do this using the QlikView script. At the end of this chapter, we will look at using QlikView's graphical ETL tool—Expressor—to provision data for QlikView.

These are the topics that will be covered in this chapter:

  • Reviewing data load and storage concepts
  • Understanding why to use an ETL approach
  • Using an ETL approach to create QVD data layers
  • Mastering loading techniques:
    • Incremental load
    • Partial load
    • Binary load
  • Using QlikView Expressor for ETL

Reviewing data loading concepts

There are a few things that we need to remind ourselves of before we can fully grasp the concepts covered in this chapter.

Getting data from anywhere

QlikView is data-agnostic. It doesn't care where the data comes from, all QlikView cares about is whether the data is numeric or alphanumeric, and if it is numeric, does it have an alphanumeric representation that needs to be stored. Hence, for practical discussion purposes, there are only two datatypes in QlikView—numeric and dual.

Note

QlikView does actually recognize both integer and float values and stores them accordingly, with floats taking up more storage bytes. If the numeric values have a format, then they are stored as duals—with the number and the formatted string stored together. The Floor function will not only remove decimals from a number, leaving just an integer, but it will also remove any formatting so it will reduce the amount of space needed to store the values.

This is sometimes difficult for people coming from a database world, where there can be great difficulty in moving data from one place to another. Database ETL designers will have to worry about whether the source data is one length of string versus the target. In QlikView, we need not worry; it is just going to be text.

There are sometimes issues due to this, such as when there is an ambiguity about what the data is, but it does save a lot of time. This is especially true when we need to bring data together from multiple data sources. We may have sales information coming from an ERP system, user information coming from an HR system, and customer contact information coming from a CRM system. Then, add to that budget information from Excel. Because we don't care about the strict datatypes, QlikView can handle all of this data easily. We can start building our applications and delivering results very quickly.

One of the reasons that QlikView can be better than traditional reporting solutions is that QlikView takes a snapshot of the data into the memory and users will query that snapshot. This takes a load off the core systems because the users' queries are not continually running against a production database. However, to make this an even better situation, we need to make sure that QlikView plays nicely with the database and we are not attempting to load 20 million transaction records every 30 minutes. That is behavior that makes us very unpopular with DBAs very quickly.

The data-from-anywhere ability of QlikView is also a great advantage over many other systems, where you might be limited to only connecting to one data source at a time and are forced to write ETL to move other data sources into the common source. Some other systems have the ability to combine data from multiple sources, but often not in such a straightforward way. One of the reasons ETL has developed as a software category is the ability to report on data from multiple sources. Companies had no option but to move the data into a central warehouse where reports could be run. There are, of course, some very good techniques and practices that have come out of ETL processing that we can apply to QlikView implementations—techniques that will save us from the wrath of the DBA!

Loading data from QlikView

One technique that is often quickly forgotten by QlikView developers, if they ever knew about it in the first place, is the BINARY load. This statement will load all of the data of a QlikView file (.qvw) into another—the data tables, symbol tables, and so forth. Once they have been loaded into the new file, you can use it as is, add additional data, remove and reload tables, or perform any other processing that you want.

Because you are loading another file's data tables, symbol tables, and other tables into a new file, there is one restriction in that the BINARY statement must be the very first statement in the script, as shown in the following screenshot:

Loading data from QlikView

Using this technique, you might have a chain of binary loading documents, each one loading from the one before, but then adding some new data or even removing rows or whole tables, to make it more unique. Another use case is to have completely different documents from a frontend visualization point of view, with different audiences, that share the same data model—one document can load the data while the other documents simply binary load from the original.

Loading similar files with concatenation

We already talked about automatic and manual concatenation in the Joining data section Chapter 2, QlikView Data Modeling. We will recall that if two tables with the same number of identically named fields are loaded, then QlikView will automatically concatenate those tables.

If we load data from file-based sources using wildcards in the filenames, QlikView will attempt to load each of the matching files. As long as the files contain the same set of fields, the rows in each file will be automatically concatenated, for example:

Load Field1, Field2, Field3
From File*.txt (txt, utf8, embedded labels, delimiter is ',', msq);

As long as every file that matches the wildcard File*.txt contains the three fields listed, they will all be concatenated.

Note

The wildcards available are the standard Windows ones—* to represent zero or many characters and ? to represent just one character.

Loading dissimilar files with Concatenate and For Each

So, if similar files can be loaded using a simple wildcard, what if there are differences, perhaps even just a field or two, but you would still like to concatenate the fields? This might be a common use case if you are loading files that have been generated over time but have had new fields added to them during that period. The older files won't have the fields, so rather than try and retro-fit those files, we can handle them like this:

// Assign a variable with blank text
Set vConcatenateOrders='';
FOR Each vFilename in FileList('c:DataFilter*.txt')
  Orders:
  $(vConcatenateOrders)
  LOAD *
  FROM
  $(vFilename)
  (txt, utf8, embedded labels, delimiter is ',', msq);
  // Update the variable with a concatenate statement
  Set vConcatenateOrders='Concatenate  (Orders)';
  
Next

The For Each statement combined with the FileList function will loop through all of the values that are returned by the file specification. The full absolute path (for example, C:DataOrderExport2.csv) will be assigned to the vFilename variable.

Note

There is also a function called DirList that will return a list of folders. Both FileList and DirList will return their values in dictionary order.

Understanding QlikView Data files

A QlikView Data (QVD) file is a file format that QlikView uses to store a table of data to disk. Only one table can be stored in each QVD file.

A QVD contains three parts:

  • An XML header, which describes the data contained in the QVD. This XML file also contains useful information about the date and time that the QVD was created, the name of the document that created the QVD file, the name of the table in QlikView, and lineage information about where the data originated from—which database queries or table files made up the table in QlikView before the data was stored to QVD.
  • The symbol tables for each field in the data in a byte-stuffed format. Byte stuffing helps remove potentially illegal characters from the data. Although this can increase the size of the stored data over the original data, it is usually not significant for symbol tables.
  • A bit-stuffed data table is a table of index pointers that points to the symbol table values (as we discussed in Chapter 1, Performance Tuning and Scalability).

So, basically the QVD file is an on-disk representation of how that data is stored in memory. For this reason, loading data from a QVD file back into memory is very fast. In fact, if you do no transformation to the data in the load script, then the load is essentially straight from disk into memory. This is the fastest way of getting a single table of data into QlikView.

Even if you need to transform the data, or use where clauses, the data load is still very fast—as fast as from any other table files. There are a couple of operations that can be performed on a QVD that do not interfere with the fastest, most optimized load:

  • Rename fields using As
  • A Where clause using a single Exists

Storing tables to QVD

When we have loaded data into a table, we can store that table to an external file using the Store command. The basic syntax of the command is like this:

Store TableName into path_to_file (format);

This is the syntax that would be used 99 times out of 100. There is a slightly more advanced syntax, where we can specify the fields to be stored:

Store Field1, Field2, Field3 from TableName into path_to_file (format)

The path will be any valid absolute, relative, or UNC path to the file that you wish to create. The format is one of three values:

Format

Description

qvd

This creates a file of type QVD as described previously.

txt

This creates a comma-separated Unicode text file.

qvx

An XML-based table format that QlikView can read. Because this is an open format, it is often used by third-party organizations to export data to be loaded into QlikView.

If the format is omitted, qvd will be used. Because of that, you will usually see Store statements without the format specified. A best practice would be to always include the format, even if it is QVD.

Some examples of valid Store statements are:

Store Sales into D:QlikViewQVDSales.qvd;
Store OrderID, OrderDate, CustomerID, ProductID, SalesValue 
From Sales into ..QVDSales.qvd (qvd);
Store Customer into \qvserverdataCustomer.qvx (qvx);
Store Product into ..csvProduct.csv (txt);

Using QVD files

One of the things that new developers often ask about QVDs is, "why?". They wonder why they need to use QVDs. They know that they can connect to a database and read data and they feel that they can do that again and again and don't see any reason why they need to bother writing the data to a QVD file first. There are, however, several very good reasons to store data in QVDs:

  • Speeding up loads by storing data that doesn't change, or doesn't change very frequently. Loading data from a database is relatively much slower than loading data from a local QVD. For example, if you have 2-year-old transactions, that won't change; you could have those in QVDs, and then load newer transactions from the database and concatenate the two sets of data. Of course, this also reduces the load on the database server because we are only looking for relatively few rows of data on each SQL call.
  • Combining data from multiple different sources. For example, we could have a new ERP system in place but we also want to add in sales information from an old system. If we keep the old data in QVD, we don't need to have the old database online, so it can be decommissioned.
  • Incremental load is the ultimate use of QVDs to load transactional information in the minimum amount of time possible. Basically, we load only the newest data from the database, combine with the older data from locally stored QVDs, and then update the QVDs.

    Note

    There is an excellent section on this in both the QlikView Reference Manual and in the QlikView Help file—search for Using QVD Files for Incremental Load. We will run through an example of this later in this chapter.

  • As discussed in Chapter 2, QlikView Data Modeling, dimensional modeling approaches say that we should use conformed dimensions where dimensions are shared across different models. This is an excellent use of QVDs—we create the QVD once and then can share it across many QlikView documents. Even if we are not following a strict dimensional modeling approach, we can still use QVDs to reuse data in more than one application.
  • Implementing data quality when preparing data for users. A cleaned set of QVD files, that are centrally created and controlled, can be provisioned for users with confidence that data is correct.

Just from a development point of view, you will find that you are performing reloads again and again as you are perfecting your data model. If you are reloading from a slow database connection, this can be painful. If you create local QVDs, then your development efforts will proceed a lot faster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset