Reducing storage requirements

While it is often tempting to include as much data as possible, this can also make your data model more complex than it needs to be. Additionally, as QlikView uses an in-memory database, it is also a good idea to not waste resources. RAM is still a much scarcer resource than disk-based storage.

Using number-based key fields

When linking between tables using key fields, it is advisable to use numbers as key values instead of text. The AutoNumber() script function can be used to generate a unique integer value for an expression or compound key, thus compacting the occupied RAM space.

Consider, for example, the following list of colors and their corresponding value assigned by the AutoNumber() function. As each new value appears in the list, a consecutive number is assigned. All subsequent appearances of the same value will take the value assigned to the first instance. For example, all appearances of the Blue color have been assigned the number 2, as it is the second value we encounter on the list:

Color

AutoNumber(Color)

Red

1

Blue

2

Green

3

Blue

2

Green

3

Yellow

4

A second parameter can be specified to handle more than one counter in the same script and indicate which one should be used to assign the values. For example, AutoNumber(Color, 'Color') would use a counter called Color. As QlikView is very efficient at compressing sequential numbers, it is advisable to use a different counter for each separate key.

It is important to note that the AutoNumber() function returns a number solely based on the load order. Encoding the same value in different QVW files might return different numbers. Therefore, it is not possible to use results of the AutoNumber() function sourced from multiple QlikView documents.

Removing unused fields

Removing unused fields from the data model is a quick win in most QlikView applications. It can save anywhere between a few to hundreds of MB on bigger documents. Text fields, in particular, can take up a lot of space.

As the developer, you will probably have an idea of fields that are definitely not being used. You can either remove those from the script, or comment them out if you want to play it safe.

An automated tool that can help you spot unused files is Rob Wunderlich's Document Analyzer. This QlikView document is used to process another QlikView document and indicate which fields are not being used anywhere in the layout or expressions. Since this tool does not always correctly identify unused fields, it is advisable to always perform a sanity check before deleting fields. However, as a starting point, this is an excellent tool. The QVW file can be downloaded from http://robwunderlich.com/downloads/.

Splitting high-cardinality fields

QlikView utilizes various algorithms to compress the data to a fraction of its original size. One way it does this is by storing only the unique values of a field. For example, a table containing a list of colors, in which some of the values appear more than once, would be compacted in the following manner:

Splitting high-cardinality fields

You can imagine that columns containing few distinct values (or, low cardinality) will be compressed much better than those with many distinct values (high cardinality).

If possible, it is worth considering if a high cardinality field can be split into multiple low cardinality fields.

Consider the example of a timestamp field, which contains a date and a time. If we were to load data corresponding to a single year, it could potentially lead to 31,536,000 unique timestamp values: 365 days x 24 hours x 60 minutes x 60 seconds.

However, we could also decide to split the timestamp into two fields: a date field and a time field. In this scenario the maximum number of unique values would be reduced to 86,765, that is, 365 days for the date field and 86,400 for the time field (24 hours x 60 minutes x 60 seconds). This is only 0.28 percent of the original volume, and can therefore have a tremendous impact on the document size, especially on larger data volumes.

Note

In some cases, the time component is not needed at all, so it's a good practice to simply truncate the timestamp value and keep only the date component. The Floor() function can be used to accomplish this, which removes the decimal part of a given numeric value.

Two more cases where this technique is applicable are phone numbers and big numbers. For example, if we have a measure that contains integer numbers ranging from 0 to 100,000,000, we can potentially end up with 100 million unique values. As an alternative, we could split the number into two 10,000 (the square root of 100 million) number ranges using the Div() and Mod() functions, shown here:

LOAD
     Div(BigNumber, 10000) as BigNumber1,
     Mod(BigNumber, 10000) as BigNumber2
FROM BigNumber;

The first expression in the preceding code performs an integer division of the value of BigNumber by 10,000. This means that only the resulting whole number is stored. The second expression performs a modulo operation, storing only the remainder of BigNumber divided by 10,000. Both of these fields have a potential of 10,000 unique values. That's 20,000 possible unique values when combined, or only 0.2 percent of the original list.

In the final application, we can then restore the original number by multiplying BigNumber1 with 10,000 and adding the remainder from BigNumber2:

(BigNumber1 * 10000) + BigNumber2

Of course, this calculation will be more processor-intense than a straightforward calculation on a single value. What works best "depends," so it is best to always perform a thorough test before implementing a solution like this.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset