In June 2012, Qlik announced the purchase of Expressor Software. The press release talked about a metadata intelligence solution for data lineage and data governance, but what exactly is this product?
There are a couple of parts to the technology that are interesting. The main business of Expressor Software was the creation of an ETL tool that could connect to multiple different data sources, read data, and write it out to different locations. As part of this, they happened to create connectors that could connect to QlikView files—QVW, QVD, and QVX—and read both data and metadata from those files. They also created connectors to write out data in QVX format. Obviously, they felt that the QlikView market was worth going after.
Almost as a side effect, they were able to create the genesis of what is today the QlikView Governance Dashboard. Using their technology, they were able to connect to QlikView files and read enough metadata to create a full governance solution about a QlikView implementation. This was actually a big deal because governance was something that Qlik was getting beaten about with by competitors. Now there was an effective solution—Qlik liked it so much, they bought the company.
Expressor is actually formed of three major components:
As an ETL tool, Expressor Studio is quite intuitive for those who have experience with other ETL tools. It has some differences but many similarities.
Most ETL tools will have some kind of scripting/development language to enable the building of business rules to be applied to data during the transformation stage. With Expressor, that language is Lua:
One thing that Expressor has, that makes it different, is its ability to partition data on the fly during data loads and make the data loading process multithreaded. Most Qlik developers will be familiar with data being loaded and processed one row at a time. Expressor will intelligently partition the entire load into separate sets of rows and then load each of these partitions simultaneously. This can make a huge impact on data load times, significantly reducing them.
Why, when QlikView and Qlik Sense already have a rich and powerful scripted ETL ability (as we have seen already), would we consider using Expressor instead?
The very simple answer is, control. By using Expressor or any other ETL tool to create the QVD or QVX data model layer, we are taking control of the data provisioning and centrally controlling it. Policies and security are put in place to make sure that QlikView users, no matter what ability, cannot get enterprise data other than via the QlikView data layer.
This could be seen as a downside by QlikView developers, who may be anxious to get applications built and deployed. However, granting such access to the enterprise data systems is not seen as a good practice in data governance. We extract the data, apply common business rules, and deploy the QlikView data layer from a central tool.
We can still, probably, make the argument that the data layer could still be provisioned using QlikView. However, there are still very good reasons to use Expressor instead:
Within QlikView Expressor, we will partition our own work into different organization units to better manage what we are doing. The terminology is very different from QlikView, so we need to understand that now.
A workspace is a collection of related projects. As with most things in this regard, there are no hard-and-fast rules about how many workspaces you need to create. Some organizations may have one. Others have one for every project. The norm is somewhere in between the two. We will probably have a workspace for related areas of ETL—perhaps by line-of-business or by data source.
There are two types of workspaces—standalone and repository. A standalone workspace will be stored locally on the developer's workstation. The repository workstation is stored in the Expressor repository. A standalone workspace can be converted to a repository workspace.
When we first open QlikView Expressor, we are presented with some options for workspaces:
When we select the New Workspace… option, we are presented with the following screen:
We can pick that our new workspace is either Standalone Workspace or Repository Workspace. If we select Standalone Workspace, we can specify the path to where the workspace will be stored. If we select Repository Workspace, we will give the connection information to where the repository is stored.
The repository is an Expressor implementation of the Subversion versioning system. This will be available with an Expressor server and is useful for multideveloper environments.
In QlikView Expressor, extensions are code libraries that allow Expressor to read and write to different file types and databases. There are some extensions that are installed out-of-box (such as the QlikView Extension), but we need to make sure that they are enabled. We need to access the Manage Extensions… option from the Desktop menu:
When we select the menu option, the Manage Extensions dialog opens:
Within this window, we can use the Current Workspace Settings tab to enable or disable extensions. We can use the Installs tab to install a new extension or uninstall an existing one.
Basically, a library and a project are the same thing. Both of them are storage locations for other objects, such as data flows, connection information, and so forth. The only difference is that a library cannot contain a deployment package—only a project can produce one of these packages for the integration engine.
A library is used to hold objects, or artifacts as they are called in Expressor, which will be shared among other projects.
To add a new project or library, you can select the Project or Library buttons on the ribbon bar or you can right-click on the workspace name:
To add either, we just need fill in a name and description in the dialog and click on the Create button:
When we create the project, we will see several folders—one for each of the types of artifact that may make up our project:
The different artifacts are as follows:
Before we can read or write data with QlikView Expressor, we need to configure a connection. We have a choice of three different connections. To add a connection, we can either right-click on the Connections
folder under the project or we can click on the Connection button in the ribbon bar:
Here is a description of the different connection types:
Configuring a File connection is quite straightforward. We just need to know the path to the folder:
After clicking on the Next button, we can enter a name (remember, no spaces) for the connection and a description:
We continue adding File connections for every folder that we are going to read from or write to.
If we are going to have many projects reading from or writing to the same set of folders, the connections should be configured in a library.
The connection to a database is fairly straightforward. Expressor comes with drivers installed for the following:
In addition, Expressor will natively support (can use its own property dialogs to configure the connection) the following drivers if they are installed from the vendor websites:
Finally, Expressor will also support other ODBC drivers, but a DSN will need to be configured outside of Expressor.
To add the database connection, we first need to select the correct driver to use:
We then fill in the database specific connection information (Expressor will test the connection before allowing you to continue) and then give the connection a name.
The QVX Connector Connection uses the same connector that we would use in QlikView to connect to an on-demand package. The only packages that can be used are those that will generate a QVX output.
There is a slightly different approach here in that we name the connection before we enter the connection details:
When we click on the Create button, Expressor will go ahead and save the connection and open the properties for us to edit.
We select the QlikViewExpressorConnector.exe option as the connector to use (you may also see the governance connector in the list, if you have the Governance Dashboard installed on the same machine). Click on the Actions button and select Build Connection String:
The following instructions are seen in the dialog box:
The instructions are as follows:
c:usersusernamedocumentsexpressorWorkspaces
) and select a Project
folder that contains a package (the dialog won't let you select a folder unless there is a package in it).As we just mentioned, we have two kinds of types, Atomic and Composite.
At its simplest, an Atomic type is simply a basic type such as string, date, number, and so forth. We can extend these basic types by adding constraints, such as length or value, to those basic types—implementing business rules.
We add a new Atomic type by right-clicking on the Types
folder under the project and/or by clicking on Type in the ribbon bar:
We can now give a name to our Atomic type and Expressor will open the properties page for us to enter basic type and constraint information:
Depending on the base datatype, we can set different levels of constraint. If the constraint rule fails, we can set different corrective actions to be performed. The default is that the type should Escalate the breach of the constraint, which would normally throw an error in the dataflow.
Once we have set our constraints, we can save the Atomic type.
A Composite type is a collection of types that we can map our data onto. So, for example, we can create an order Composite type that represents exactly how we think an order should look. When we import external data, we can map that external data to our Composite type. By mapping everything to a Composite type, which can also encapsulate data constraints, we ensure consistency.
We create a Composite type by right-clicking on the Types
folder under the project or by clicking on Type on the ribbon bar, as with the Atomic type. We name the artifact as usual and Expressor will open the properties window:
We can click on the Add button to add a new Atomic type to our Composite type:
For each attribute, we can assign a datatype and any constraints. We can also assign a Semantic Type (Atomic type) that we have previously configured (Shared) or create a new Atomic type (New (local) that will be added to the current project):
If we do assign a Semantic Type to the field, the constraint options will become grayed out because the constraints will be inherited from the Atomic type.
A schema represents the data that is being read or being written. We can have different types of schema for different types of data. For example, text data is handled differently than database data, which is handled differently from QlikView data. A schema can be configured from either the data itself—which is the usual way—or from a Composite type that we have already configured.
To configure a schema, we can either right-click the Schemas
folder under the project or click on the Schema button in the ribbon bar:
A wizard will open that will allow us to configure the schema from the data. For file sources, we can either browse to the file or we can just paste some sample rows (browsing for the file will just load the first 10 rows):
We then specify the Field Delimiter value and the Record Delimiter value, any quote information, and the Encoding type. When we click on Next, the dialog will preview the data and show us the column headings. We can use the first row of the data for column headings by clicking on the Set All Names from Selected Row button:
After clicking on Next again, we can give a name to the schema (usual artifact name rules apply) and click on Finish to save the schema. Once we have saved the schema, we need to edit the details—we right-click on it and select Open:
We note that the schema has been assigned a generated Composite type (CompositeType1) and that the input fields are mapped to it. However, we want to assign our Composite type that we have already configured.
On clicking the Actions button to the right-hand side of CompositeType1, we can select to add a Shared type. When we select our Composite type, we will be prompted to generate a new mapping:
We would normally choose Yes to allow the system to make the mappings for us:
We do need to tweak a couple of the mappings. If you click on the link from OrderDate to OrderDate and then click on the pencil icon above it (or just double-click on the link), we can enter the correct format string for Expressor to interpret the text:
In this case, the data is in UK date format, so we need to specify DD/MM/CCYY, where D represents day, M represents month, C represents century, and Y represents year.
We should also edit the mapping for the sales price field because there is a dollar symbol. Expressor allows us to take care of that:
We can now select the CompositeType1 type from the drop-down menu and use the Actions button to delete it. Hit the Save button to save the schema.
This schema that we have just created can be used to either read or write text files. In fact, it is a good idea to design your schemas based on the required output rather than the inputs.
Now that we have configured connections, types, and schemas, we can create a simple dataflow to move data from a text object into a QVX file for consumption by QlikView.
When creating a dataflow, we have a toolbox available of different operators that we can use within the dataflow. There are four categories:
The Inputs toolbox contains eight options (depending on what extensions you have turned on), each used to read data:
This table describes each of the read operators:
The Outputs toolbox contains nine operators (depending on what extensions you have turned on), each used to write data in different ways:
This table describes each of the write operators:
Transformers are operators that allow us to transform data. As such, they will form a central part of almost any dataflow. There are six operators available:
This table describes each of the transformer operators:
The Utility operators contain several operators that operate on data in ways that are not transformative, but are useful in an ETL system. There are six operators available:
This table describes each of the utility operators:
We add a new dataflow in a similar manner to other artifacts—right-click or use the ribbon. When the dataflow is first added, a blank workspace appears:
To the left-hand side, we have the operator panel. We can click-and-drag any operator from the panel onto the dataflow workspace:
On the right-hand side is a properties panel that allows us to set the properties for the currently selected object.
After we have dragged an operator such as Read File onto the dataflow, we need to modify its properties:
The properties that you need to fill out are as follows:
If we drag a Transformation operator from the Transformation panel onto the dataflow, we can then click on the output of the Read File operator and drag the mouse to the input of the Transformation operator:
The Read File operator should now change to green because it is complete and doesn't need us to do anything else. If it doesn't, we need to look at the messages panel to find out what we have forgotten!
If we click on the Edit Rules button on the Transformation operator's properties panel (or double-click on the Transformation operator), then the edit rules dialog opens:
The list of fields on the left-hand side is the list from the incoming stream. The list on the right-hand side is the list of fields that will be output. With nothing else changed, we see that the lists match. Every input field will be output. The white arrow to the left of each output field indicates that they are being output because they are input fields.
We can block an input field from being output by selecting that field in the left-hand side list and clicking on the Block Inputs button on the ribbon bar. If we do that for the LineValue field, the lists will look like this:
If we want to actually output a field that contains the line value, we can calculate it from the Quantity and SalesPrice fields. We need to first click on the Add button on the ribbon bar to add a new attribute to the Outputs list—for example, LineSalesAmount.
We then click on the
New Rule button on the ribbon bar and select Expression Rule. We drag Quantity and SalesPrice onto the Add Input side of the rule and we drag our new LineSalesAmount field onto the Add Output side of the rule. In the expression area, we can replace nil
with Input.Quantity*Input.SalesPrice
:
Once we have configured the Transformation operator, we can now drag a Write QlikView operator from the Outputs panel to dataflow and connect the output of the Transformation operator to the input of the Write QlikView operator.
We then set the properties of the output. Now, we haven't defined a schema for the output file, but Expressor has a nice facility where it can generate a new schema from the output of the previous operator. Clicking on the Actions button to the right-hand side of the Schema dropdown gives us a menu where we can select this option.
Once all the operators have been configured, they should all turn green. Now we can save it and test it. If all goes well, we should have a QVX file in the output folder.