Structuring your data

In a database, there are rules about where different entities are stored. For instance, everything about the customers should be stored in the Customers table. A customer identifier is stored in this table, which means that the necessary data can be retrieved by a simple lookup in the Customers table. So, if you need to refer to a customer from another table, you just store this identifier in the other table.

Normalization

The identifier needs to be unique in the Customers table, so that every record can be addressed. Here the field is called a primary key. In other tables, duplicates may exist. For example, several records in the Orders table might refer to the same customer. If this is the case, the key is called a foreign key.

A consequence of this setup is that a database can have many tables, often thousands. One table for customers, one for orders, one for order lines, one for products, one for product groups, one for shippers, one for invoices, and so on. Each table contains one entity type, and each instance of this entity has exactly one row in the table—one record.

In the customer example, it means that each customer is stored in one row only, and all the attributes of this customer are stored in the different columns of this row. This is called normalization.

The main goal with normalization is to avoid redundancy. In a transactional database, it is important that a piece of information is stored in only one place. Not only does it take less space, it also speeds up the update process and minimizes errors. You can imagine the confusion if, for instance, a customer address is stored in several places and the different occurrences contain different addresses. Which version should you trust?

So, the source data is often highly normalized. But does this mean that the Qlik Sense data model should be normalized too?

Yes and No.

The Qlik Sense data model does not need to be as normalized as the source data. Redundancy is not a problem, and if duplicating data improves the user experience, you should definitely consider using duplicate data. The data model should be optimized for user experience and performance, not for minimal size.

But some normalization has great advantages—structuring the data in entities (tables) simplifies the data modeling work. It also makes maintenance simpler, since a data model and a script can be understood by a developer who has never seen it before. Finally, the QIX engine works better with a normalized model. It is easier to make it calculate numbers correctly and avoid double counting, which is a real problem when you start to de-normalize.

So the bottom line is that you should have some normalization, but it does not need to be as strict as in the source data.

The main case in which you need to de-normalize is if you use the same entity in different places in the data model.

For example, you may use a table listing external organizations in the context of supplier, shipper, as well as customer. This means the Organization table is used in different roles. In such a case, you should load the organization table three times: first as a supplier, then as a shipper, and finally as a customer, linking to the three different foreign keys.

Another common situation is that you have several dates in your data model: OrderDate, RequiredDate, ShippingDate, and InvoiceDate. In other words, the date is used in different roles. Also, here you should load the dimension—the calendar—several times, once per date field.

Another reason to de-normalize is for optimization purposes. One of the cases would be if you have several very large tables linked to each other, for example, if you have an order headers table as well as an order lines table, and both are large (millions of records). From a performance perspective, this is not optimal. The QIX engine will need more CPU cycles than if the information of the two tables had been stored in one single table. So, you might want to join both the tables for performance reasons.

A small word of warning though, joining tables is not always safe. This operation may lead to a record being duplicated on multiple records, and if the record holds a number, the summation made by the QIX engine will lead to an erroneous result—the number will be counted twice. In the case of order headers and order lines, you know that an order line can belong to one order header only, so the numbers in the order lines table will not be duplicated. Hence, it is safe to join here.

However, if you have a number in the orders table, it will be duplicated. But luckily, this table rarely contains any numbers.

Star schema and snowflake schema

The normalization is usually quite different in the source data model and in the analytical model. For the source data, one often talks about transaction tables and master tables. The transaction tables are the ones that store orders, invoices, and other transactions. In these, new records are typically added every hour, minute, or even second.

This is very different from the master tables, where new records are added much more rarely: Products, Customers, and the Calendar are typical master tables.

Master tables are often used for many purposes and are usually linked to several transaction tables, which makes the data model look as if it has circular references. This is, however, not a problem, since every link means a separate lookup in the master table.

For example, in the following relational model, the Organizations table is linked to the transactional data through three keys: Customer, Shipped by, and Supplied by. This means that a specific shipment item can be linked to several organizations—one customer, one shipper, and one supplier:

Star schema and snowflake schema

In an analytical model, the tables are used in a different way. The transactions are joined into one single fact table, and the master tables are loaded as dimensions. The reason for this is partly historical. Older hypercube tools could not work unless all metrics were collected into a fact table. In addition, they could use hierarchical dimensions surrounding the fact table. The model will then look like a star; hence the name star schema:

Star schema and snowflake schema

This model has only one layer of dimensions—all the tables are directly linked to the fact table. But if the model instead uses dimensions in two or more levels, the model is called a snowflake schema:

Star schema and snowflake schema

For Qlik Sense, the difference is minimal. All three data models can be used—provided that master tables used in several roles are also loaded several times. It is also possible to have metrics in any table, not just the fact table.

A star schema is, however, both simple and efficient, so we strongly recommend using this as an initial model for your data. It is then easy to build further and make it more complex.

Pitfalls in the data model

When you create your data model, you should look out for two potential problems: synthetic keys and circular references.

Synthetic keys are automatically created if you have multiple keys between two tables. They are not by themselves bad, but they could be a warning sign of a bad data model. If you have made a mistake when loading the data, the first sign is often one or several synthetic keys. Then, you need to go back and investigate why this has happened:

Pitfalls in the data model

The preceding diagram shows a synthetic key modeled the way you loaded the data. It is a correct one that you don't need to change. Internally, it is stored differently; refer to the following diagram. In the data model viewer (as shown in the following diagram), you can toggle between these two views:

Pitfalls in the data model

Qlik Sense will, at the end of the script run, warn you about these potential problems, as shown in the following screenshot:

Pitfalls in the data model

Usually, you do not want synthetic keys—you want one single key in each table link. However, if you know that you have multiple keys linking two tables, and that this is the way you want your data model, then there is no problem in having synthetic keys.

The second potential problem is circular references. This happens if you load data in such a way that the links between the tables form a loop. The following diagram is a typical example:

Pitfalls in the data model

The circular reference from a data modeling perspective is an error and not just a warning, and you will get an error message at the end of the script run:

Pitfalls in the data model

If you have a circular reference, you must rethink your data model with the goal of removing the loop.

It could sometimes be difficult to figure out how to remove a circular reference, but a good advice is to look at every link in your data model and ask, "Are these two fields really the same thing? Or do the fields have different roles?." In the preceding screenshot, you have a circular reference where the Date field has two different roles: one is the date when the order arrived, and the other is the date when the invoice was sent. These two dates need not necessarily be the same. Hence, they should not be linked but instead loaded as two different fields.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset