Using QlikView Expressor for ETL

In June 2012, Qlik announced the purchase of Expressor Software. The press release talked about a metadata intelligence solution for data lineage and data governance, but what exactly is this product?

There are a couple of parts to the technology that are interesting. The main business of Expressor Software was the creation of an ETL tool that could connect to multiple different data sources, read data, and write it out to different locations. As part of this, they happened to create connectors that could connect to QlikView files—QVW, QVD, and QVX—and read both data and metadata from those files. They also created connectors to write out data in QVX format. Obviously, they felt that the QlikView market was worth going after.

Almost as a side effect, they were able to create the genesis of what is today the QlikView Governance Dashboard. Using their technology, they were able to connect to QlikView files and read enough metadata to create a full governance solution about a QlikView implementation. This was actually a big deal because governance was something that Qlik was getting beaten about with by competitors. Now there was an effective solution—Qlik liked it so much, they bought the company.

Introducing Expressor

Expressor is actually formed of three major components:

  • Studio: This is the desktop tool used to build the ETL packages.
  • Data integration engine: This is a GUI-free service that actually runs the packages, either on demand or on a schedule (it is a special version of this engine that is used by the Governance Dashboard).
  • Repository: This is a source repository based on the subversion versioning and revision control system. This allows multiple developers to work on the same project.

As an ETL tool, Expressor Studio is quite intuitive for those who have experience with other ETL tools. It has some differences but many similarities.

Most ETL tools will have some kind of scripting/development language to enable the building of business rules to be applied to data during the transformation stage. With Expressor, that language is Lua:

http://www.lua.org

One thing that Expressor has, that makes it different, is its ability to partition data on the fly during data loads and make the data loading process multithreaded. Most Qlik developers will be familiar with data being loaded and processed one row at a time. Expressor will intelligently partition the entire load into separate sets of rows and then load each of these partitions simultaneously. This can make a huge impact on data load times, significantly reducing them.

Understanding why to use Expressor for ETL

Why, when QlikView and Qlik Sense already have a rich and powerful scripted ETL ability (as we have seen already), would we consider using Expressor instead?

The very simple answer is, control. By using Expressor or any other ETL tool to create the QVD or QVX data model layer, we are taking control of the data provisioning and centrally controlling it. Policies and security are put in place to make sure that QlikView users, no matter what ability, cannot get enterprise data other than via the QlikView data layer.

This could be seen as a downside by QlikView developers, who may be anxious to get applications built and deployed. However, granting such access to the enterprise data systems is not seen as a good practice in data governance. We extract the data, apply common business rules, and deploy the QlikView data layer from a central tool.

We can still, probably, make the argument that the data layer could still be provisioned using QlikView. However, there are still very good reasons to use Expressor instead:

  • It isn't QlikView: The people who will be responsible for provisioning the data layer may not be QlikView developers. If they are experienced database developers, then they will be much more comfortable with Expressor than with QlikView scripts.
  • Speed: The ability to automatically partition data loads and run multithreaded data loads make Expressor extremely quick for getting data.
  • Repository: This helps in allowing multiple users to work on the same projects and gives versioning control to projects.

    Note

    When reading on further, it will be useful to have a copy of Expressor installed on your workstation. The installation is very straightforward and the application will run, without license, on Windows 7 and 8 desktops.

Understanding workspaces, libraries, projects, and artifacts

Within QlikView Expressor, we will partition our own work into different organization units to better manage what we are doing. The terminology is very different from QlikView, so we need to understand that now.

Creating a workspace

A workspace is a collection of related projects. As with most things in this regard, there are no hard-and-fast rules about how many workspaces you need to create. Some organizations may have one. Others have one for every project. The norm is somewhere in between the two. We will probably have a workspace for related areas of ETL—perhaps by line-of-business or by data source.

There are two types of workspaces—standalone and repository. A standalone workspace will be stored locally on the developer's workstation. The repository workstation is stored in the Expressor repository. A standalone workspace can be converted to a repository workspace.

When we first open QlikView Expressor, we are presented with some options for workspaces:

Creating a workspace

When we select the New Workspace… option, we are presented with the following screen:

Creating a workspace

We can pick that our new workspace is either Standalone Workspace or Repository Workspace. If we select Standalone Workspace, we can specify the path to where the workspace will be stored. If we select Repository Workspace, we will give the connection information to where the repository is stored.

The repository is an Expressor implementation of the Subversion versioning system. This will be available with an Expressor server and is useful for multideveloper environments.

Note

Anyone who has used Subversion may note that the default port that Expressor uses is 53690, whereas the default Subversion port is 3690. Note that you should not update the version of svn that Expressor uses to the latest version available as you will probably break the repository.

Managing extensions

In QlikView Expressor, extensions are code libraries that allow Expressor to read and write to different file types and databases. There are some extensions that are installed out-of-box (such as the QlikView Extension), but we need to make sure that they are enabled. We need to access the Manage Extensions… option from the Desktop menu:

Managing extensions

When we select the menu option, the Manage Extensions dialog opens:

Managing extensions

Within this window, we can use the Current Workspace Settings tab to enable or disable extensions. We can use the Installs tab to install a new extension or uninstall an existing one.

Working with libraries and projects

Basically, a library and a project are the same thing. Both of them are storage locations for other objects, such as data flows, connection information, and so forth. The only difference is that a library cannot contain a deployment package—only a project can produce one of these packages for the integration engine.

A library is used to hold objects, or artifacts as they are called in Expressor, which will be shared among other projects.

To add a new project or library, you can select the Project or Library buttons on the ribbon bar or you can right-click on the workspace name:

Working with libraries and projects

To add either, we just need fill in a name and description in the dialog and click on the Create button:

Working with libraries and projects

Note

Note that the name of the project, as with all other artifacts, cannot have spaces in it. It can only consist of letters, numbers, and the underscore symbol. It must begin with a letter.

Understanding artifacts

When we create the project, we will see several folders—one for each of the types of artifact that may make up our project:

Understanding artifacts

The different artifacts are as follows:

Artifact

Description

Dataflows

A Dataflow artifact is the actual flow of data, in one or more steps, that is the actual ETL process. They are defined via a drag-and-drop visual interface.

Connections

The Connection s artifacts tell Expressor how to connect to the data. We can have file connections—basically a path to a folder, database connections, or QVX connector connections—a connection to a package that will generate a QVX.

Schemas

Schemas map the source or target data to the datatypes that are understood by Expressor. They may encapsulate transformations.

Types

The Types artifact will contain semantic type information about data. We have two types of semantic type, atomic—mapping data type and constraints for one piece of data, and composite—essentially mapping an entity of atomic types.

Datascript Modules

The artifacts will contain Lua functions that can be called from transformation scripts or can be used to implement a custom data read.

Lookup Tables

Not dissimilar to mapping tables in QlikView, these are locally stored tables that we can use to map values as part of a transformation.

Configurations

Configurations can contain multiple parameters that we can then use throughout the other artifacts. A large number of the settings can be parameterized. By having multiple configurations, it allows us to set up things such as Dev/Test/UAT/Production without having to reconfigure every artifact.

Deployment Packages

Packages are compiled dataflows, along with their associated artifacts, that will be executed by the integration engine—either scheduled or on demand.

Operator Templates

Within a dataflow, we can configure Read, Write, and Transformation operators. Once configured, we can save that operator as a template to be reused.

External Files

These are basically any type of file that might be used by data scripts.

Library References

When we add a reference to a library, all that library's artifacts will become available within the project as if they were part of the project.

Configuring connections

Before we can read or write data with QlikView Expressor, we need to configure a connection. We have a choice of three different connections. To add a connection, we can either right-click on the Connections folder under the project or we can click on the Connection button in the ribbon bar:

Configuring connections

Here is a description of the different connection types:

Connection

Description

File

The File connection allows us to connect to a folder on the filesystem—either a local folder or a server share. We can use this folder to both read and write data. Typically though, read will be from one connection and write will be to another.

Database

The Database connection allows us to connect to different databases using an ODBC driver. Drivers are supplied for some of the more common databases and you can use an existing DSN for others. The connection can be read and/or write. As with the File connection, the typical implementation will have different connections for read and for write, or you will read from a database connection and write to a file connection.

QVX Connector

The QVX connection allows us to use the installed QlikView Expressor connector—the same one that you can use from within QlikView—to execute an existing package and read the QVX data. This is a read-only connection.

Configuring a File connection

Configuring a File connection is quite straightforward. We just need to know the path to the folder:

Configuring a File connection

After clicking on the Next button, we can enter a name (remember, no spaces) for the connection and a description:

Configuring a File connection

We continue adding File connections for every folder that we are going to read from or write to.

If we are going to have many projects reading from or writing to the same set of folders, the connections should be configured in a library.

Connecting to a database

The connection to a database is fairly straightforward. Expressor comes with drivers installed for the following:

  • Apache Hive
  • Cloudera Impala
  • IMB DB2
  • Microsoft SQL Server
  • MySQL Enterprise Edition
  • Oracle Database
  • PostgreSQL
  • Sybase ASE
  • Teradata

In addition, Expressor will natively support (can use its own property dialogs to configure the connection) the following drivers if they are installed from the vendor websites:

  • Informix
  • MySQL Community Edition
  • Netezza

Finally, Expressor will also support other ODBC drivers, but a DSN will need to be configured outside of Expressor.

To add the database connection, we first need to select the correct driver to use:

Connecting to a database

We then fill in the database specific connection information (Expressor will test the connection before allowing you to continue) and then give the connection a name.

Creating a QVX Connector Connection

The QVX Connector Connection uses the same connector that we would use in QlikView to connect to an on-demand package. The only packages that can be used are those that will generate a QVX output.

There is a slightly different approach here in that we name the connection before we enter the connection details:

Creating a QVX Connector Connection

When we click on the Create button, Expressor will go ahead and save the connection and open the properties for us to edit.

We select the QlikViewExpressorConnector.exe option as the connector to use (you may also see the governance connector in the list, if you have the Governance Dashboard installed on the same machine). Click on the Actions button and select Build Connection String:

Creating a QVX Connector Connection

The following instructions are seen in the dialog box:

Creating a QVX Connector Connection

The instructions are as follows:

  1. Add folders containing Deployment Packages: Click on the Add Folder button and browse for a folder that contains a workspace (the default folder for workspaces is c:usersusernamedocumentsexpressorWorkspaces) and select a Project folder that contains a package (the dialog won't let you select a folder unless there is a package in it).
  2. Then select a Dataflow with a QlikView output:
    Creating a QVX Connector Connection

Configuring types and schemas

As we just mentioned, we have two kinds of types, Atomic and Composite.

Adding additional Atomic types

At its simplest, an Atomic type is simply a basic type such as string, date, number, and so forth. We can extend these basic types by adding constraints, such as length or value, to those basic types—implementing business rules.

We add a new Atomic type by right-clicking on the Types folder under the project and/or by clicking on Type in the ribbon bar:

Adding additional Atomic types

We can now give a name to our Atomic type and Expressor will open the properties page for us to enter basic type and constraint information:

Adding additional Atomic types

Depending on the base datatype, we can set different levels of constraint. If the constraint rule fails, we can set different corrective actions to be performed. The default is that the type should Escalate the breach of the constraint, which would normally throw an error in the dataflow.

Once we have set our constraints, we can save the Atomic type.

Creating Composite types

A Composite type is a collection of types that we can map our data onto. So, for example, we can create an order Composite type that represents exactly how we think an order should look. When we import external data, we can map that external data to our Composite type. By mapping everything to a Composite type, which can also encapsulate data constraints, we ensure consistency.

We create a Composite type by right-clicking on the Types folder under the project or by clicking on Type on the ribbon bar, as with the Atomic type. We name the artifact as usual and Expressor will open the properties window:

Creating Composite types

We can click on the Add button to add a new Atomic type to our Composite type:

Creating Composite types

For each attribute, we can assign a datatype and any constraints. We can also assign a Semantic Type (Atomic type) that we have previously configured (Shared) or create a new Atomic type (New (local) that will be added to the current project):

Creating Composite types

If we do assign a Semantic Type to the field, the constraint options will become grayed out because the constraints will be inherited from the Atomic type.

Configuring a schema

A schema represents the data that is being read or being written. We can have different types of schema for different types of data. For example, text data is handled differently than database data, which is handled differently from QlikView data. A schema can be configured from either the data itself—which is the usual way—or from a Composite type that we have already configured.

To configure a schema, we can either right-click the Schemas folder under the project or click on the Schema button in the ribbon bar:

Configuring a schema

A wizard will open that will allow us to configure the schema from the data. For file sources, we can either browse to the file or we can just paste some sample rows (browsing for the file will just load the first 10 rows):

Configuring a schema

We then specify the Field Delimiter value and the Record Delimiter value, any quote information, and the Encoding type. When we click on Next, the dialog will preview the data and show us the column headings. We can use the first row of the data for column headings by clicking on the Set All Names from Selected Row button:

Configuring a schema

After clicking on Next again, we can give a name to the schema (usual artifact name rules apply) and click on Finish to save the schema. Once we have saved the schema, we need to edit the details—we right-click on it and select Open:

Configuring a schema

We note that the schema has been assigned a generated Composite type (CompositeType1) and that the input fields are mapped to it. However, we want to assign our Composite type that we have already configured.

On clicking the Actions button to the right-hand side of CompositeType1, we can select to add a Shared type. When we select our Composite type, we will be prompted to generate a new mapping:

Configuring a schema

We would normally choose Yes to allow the system to make the mappings for us:

Configuring a schema

We do need to tweak a couple of the mappings. If you click on the link from OrderDate to OrderDate and then click on the pencil icon above it (or just double-click on the link), we can enter the correct format string for Expressor to interpret the text:

Configuring a schema

In this case, the data is in UK date format, so we need to specify DD/MM/CCYY, where D represents day, M represents month, C represents century, and Y represents year.

We should also edit the mapping for the sales price field because there is a dollar symbol. Expressor allows us to take care of that:

Configuring a schema

We can now select the CompositeType1 type from the drop-down menu and use the Actions button to delete it. Hit the Save button to save the schema.

This schema that we have just created can be used to either read or write text files. In fact, it is a good idea to design your schemas based on the required output rather than the inputs.

Creating and packaging a basic dataflow

Now that we have configured connections, types, and schemas, we can create a simple dataflow to move data from a text object into a QVX file for consumption by QlikView.

Understanding the dataflow toolbox

When creating a dataflow, we have a toolbox available of different operators that we can use within the dataflow. There are four categories:

  • Inputs
  • Outputs
  • Transformers
  • Utility

Inputs

The Inputs toolbox contains eight options (depending on what extensions you have turned on), each used to read data:

Inputs

This table describes each of the read operators:

Operator

Description

Read File

The Read Fil e operator will connect to a file using one of the file connections and one of the schemas that we have built

Read Table

The Read Table operator will connect to a table using a database connection and read the data

SQL Query

The SQL Query operator will execute a query that has been defined in a schema against a database connection

Read Custom

Read Custom allows you to run a data script to read and generate data that can be passed to another operator—this is an extremely powerful option

Read Lookup Table

This operator reads a lookup table that has been populated by another dataflow step

Read Excel

This is part of the Excel extension that allow us to read data from Excel files

Read QlikView

This QlikView operator is part of the QlikView extension and can read from QVW, QVD, and QVX

Read QVX Connector

The Read QVX Connector operator can read data from a QVX connector

Outputs

The Outputs toolbox contains nine operators (depending on what extensions you have turned on), each used to write data in different ways:

Outputs

This table describes each of the write operators:

Operator

Description

Write File

The Write File operator will write data to a text file in a folder specified by a file connection.

Write Table

Write Table uses a database connection to write a database table. We can specify that a table should be created in the database if it does not exist.

Write Teradata PT

This allows you to write data to Teradata using Parallel Transporter. Note that you will need to download additional client libraries—TTU v13.10 or later.

Write Custom

Write Custom allows you to write data out using a data script. This is a powerful feature.

Write Lookup Table

This is used to populate an Expressor lookup table—not unlike a QlikView mapping table.

Write Parameters

This allows us to generate a parameters file that can be used to pass parameters to other options in the dataflow.

Trash

This is an interesting option—Trash takes an input and does nothing with it, it is as if you had thrown it away. It can be useful during development and troubleshooting.

Write Excel

This uses the Excel extension to create Excel output.

Write QlikView

This uses the QlikView extension to generate QVX output.

Transformers

Transformers are operators that allow us to transform data. As such, they will form a central part of almost any dataflow. There are six operators available:

Transformers

This table describes each of the transformer operators:

Operator

Description

Aggregate

This operator allows you to perform grouping and aggregation on data

Join

The Join operator allows us to join—inner, left, right, and outer—data tables together

Transform

This is the core transformation operator, where we perform many of the applications of business rules

Multi-Transform

The multi operator will allow multiple transformations to be performed and up to nine different output streams

Pivot Row

This is similar to the CrossTable function in QlikView that takes data in columns and generates a new row for each column

Pivot Column

The Pivot Column operator is the opposite of Pivot Row—it takes multiple rows and creates one row with multiple columns

Utility

The Utility operators contain several operators that operate on data in ways that are not transformative, but are useful in an ETL system. There are six operators available:

Utility

This table describes each of the utility operators:

Operator

Description

Buffer

This is a useful operator where there may be issues with the timing of arrival of records to a multi-input operator (such as a Join). It will temporarily buffer data to disk until the next operator is ready to process it.

Copy

The Copy operator will take one input stream and allow us to split that into up to 10 output streams, each containing the same data.

Filter

The Filter operator allows us to create rules to filter data into multiple different output streams.

Sort

The Sort operator does what we expect: it sorts the data. We can assign a certain amount of memory for the operator to use as well as disk storage if it needs it.

Funnel

Funnel is similar to QlikView's Concatenate but more like SQL Union—it accepts multiple input streams and returns the union in one output stream.

Unique

The Unique operator will return one row for multiple values of a key field.

Creating the dataflow

We add a new dataflow in a similar manner to other artifacts—right-click or use the ribbon. When the dataflow is first added, a blank workspace appears:

Creating the dataflow

To the left-hand side, we have the operator panel. We can click-and-drag any operator from the panel onto the dataflow workspace:

Creating the dataflow

On the right-hand side is a properties panel that allows us to set the properties for the currently selected object.

Configuring a Read File operator

After we have dragged an operator such as Read File onto the dataflow, we need to modify its properties:

Configuring a Read File operator

The properties that you need to fill out are as follows:

Property

Description

Name

Free text name that we want to apply to the operator.

Connection

The file connection from which you want to read the file. All available connections will be in the dropdown.

Schema

The name of the schema that you will use.

Type

The Composite type that the schema will map to.

Mapping

The mapping set that will be used.

File name

The name of the file.

Quotes

Choose May have quotes or No quotes—depending on whether the file will have quotes or not.

Skip rows

If your text file has a header row, you will want to set this to 1.

Error handling

Either abort the dataflow, skip the record, reject the record, skip remaining records, or reject the remaining. Rejected records are put out the rejected records stream.

Show errors

Set whether the errors are shown or not.

Adding a Transformation operation

If we drag a Transformation operator from the Transformation panel onto the dataflow, we can then click on the output of the Read File operator and drag the mouse to the input of the Transformation operator:

Adding a Transformation operation

The Read File operator should now change to green because it is complete and doesn't need us to do anything else. If it doesn't, we need to look at the messages panel to find out what we have forgotten!

If we click on the Edit Rules button on the Transformation operator's properties panel (or double-click on the Transformation operator), then the edit rules dialog opens:

Adding a Transformation operation

The list of fields on the left-hand side is the list from the incoming stream. The list on the right-hand side is the list of fields that will be output. With nothing else changed, we see that the lists match. Every input field will be output. The white arrow to the left of each output field indicates that they are being output because they are input fields.

We can block an input field from being output by selecting that field in the left-hand side list and clicking on the Block Inputs button on the ribbon bar. If we do that for the LineValue field, the lists will look like this:

Adding a Transformation operation

If we want to actually output a field that contains the line value, we can calculate it from the Quantity and SalesPrice fields. We need to first click on the Add button on the ribbon bar to add a new attribute to the Outputs list—for example, LineSalesAmount.

We then click on the New Rule button on the ribbon bar and select Expression Rule. We drag Quantity and SalesPrice onto the Add Input side of the rule and we drag our new LineSalesAmount field onto the Add Output side of the rule. In the expression area, we can replace nil with Input.Quantity*Input.SalesPrice:

Adding a Transformation operation

Creating a QVX output

Once we have configured the Transformation operator, we can now drag a Write QlikView operator from the Outputs panel to dataflow and connect the output of the Transformation operator to the input of the Write QlikView operator.

Note

QVX is an open format to allow any developer to extract data into a file or data stream that QlikView and Qlik Sense can read. Like QVD files, QVX files only contain one table of data. At time of writing, the only Qlik format that Expressor can write is QVX.

We then set the properties of the output. Now, we haven't defined a schema for the output file, but Expressor has a nice facility where it can generate a new schema from the output of the previous operator. Clicking on the Actions button to the right-hand side of the Schema dropdown gives us a menu where we can select this option.

Creating a QVX output

Once all the operators have been configured, they should all turn green. Now we can save it and test it. If all goes well, we should have a QVX file in the output folder.

Packaging the dataflow

Now that we have a working dataflow, we can package it up.

We simply add a new deployment package to the project and we can drag our dataflow into the Compiled Dataflows section.

That's it! The package can now be used in Expressor connectors and also with the integration engine on a schedule.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset