Introduction to Data Vault Architecture

Abstract

One of the most important components of the end-state architecture is that of the data vault. The data vault exists to satisfy the need for rock-solid data integrity. Like all other components of the end-state architecture, the data vault has gone through its own evolution. And like all components of the end-state architecture, data vault will continue to evolve.

Keywords

Data vault; Staging area; Landing zone; Data mart; Hadoop; NoSQL; SQL; Surrogate key; Primary key; Conformed state; Federated query

What Is a Data Vault 2.0 Architecture?

The data vault architecture is based on three tier data warehouse architecture. The tiers are commonly identified as staging or landing zone, data warehouse, and information delivery layer (or data marts).

The multiple tiers allow implementers and designers to decouple the enterprise data warehouse from both sourcing and acquisition functions and information delivery and data provisioning functions. In turn, the team becomes more nimble; the architecture is more resilient to failure and more flexible in responding to changes (Fig. 6.3.1).

Fig. 6.3.1 Data Vault 2.0 architecture overview.

The sections are staging, EDW, and information marts or information delivery layer. Regardless of platforms and technology utilized for implementation, these layers will continue to exist. However, as the system nears full real-time enablement, the need and dependency on the staging area will decline. True real-time data will feed directly into the EDW layer.

In addition to the three tiers, the Data Vault 2.0 architecture dictates several different components:

(a) The use of Hadoop or NoSQL to handle big data.
(b) The nature of real-time information flowing both IN and OUT of the business intelligence ecosystem; in turn, this also evolves the EDW into an operational data warehouse over time.
(c) The use of managed self-service BI through write-back and master data capabilities enabling TQM as well.
(d) Split of hard and soft business rules, making the enterprise data warehouse a system of record for raw facts that are loaded over time.

How Does NoSQL Fit in to the Architecture?

NoSQL platform implementations will vary. Some will contain SQL-like interfaces; some will contain relational database technology integrated with nonrelational technology. The line between the two (RDBMS and NoSQL) will continue to be blurred. Eventually, it will be a “data management system” capable of housing both relational and nonrelational simply by design.

The NoSQL platform today, in most cases, is based on Hadoop at its core—which is composed of the Hadoop distributed file system (HDFS) or metadata management for files in the different directories. Various implementations of SQL access layers and in-memory technology will sit on top of the HDFS.

Once atomicity, consistency, isolation, and durability (ACID) compliance is achieved (which is available today with some NoSQL vendors), the differentiation between RDBMS and NoSQL will fade. Note that not all Hadoop or NoSQL platforms offer ACID compliance today and not all NoSQL platforms offer update of records in place making it impossible to completely supplant the RDBMS technology.

This is changing quickly. Even as this section is written, the technology continues to advance. Eventually, the technology will be seamless, and what is purchased from the vendors in this space will be hybrid-based.

Current positioning of a platform like Hadoop is to utilize it or leverage it as an ingestion area and a staging area for any and all data that might proceed to the warehouse. This includes structured data sets (delimited files and fixed-width columnar files); multistructured data sets like XML and JSON files; and unstructured data like Word documents, Excel, video, audio, and images.

The reason is to ingest a file into Hadoop is quite simple: copy the file into a directory that is managed by Hadoop. It is from that point that Hadoop splits the file across the multiple nodes or machines that it has registered as part of its cluster.

The second purpose for Hadoop (or best practice today) is to leverage it as a place to perform data mining, utilizing SAS, or R, or textual mining. The results of the mining efforts often are structured data sets that can and should be copied into relational database engines, making them available for ad hoc querying.

What Are the Objectives of the Data Vault 2.0 Architecture?

There are several objectives of the Data Vault 2.0 architecture; they are listed below:

(a) To seamlessly connect existing relational database systems with new NoSQL platforms
(b) To engage business users and provide space for managed self-service BI (write back and direct control over data in the data warehouse)
(c) To provide for real-time arrival direct to the data warehouse environment without forcing a landing in the staging tables
(d) To enable agile development by decoupling the always changing business rules from the static data alignment rules

The architecture plays a key role in separation of responsibilities, isolating data acquisition from data provisioning. By separating responsibilities and pushing ever-changing business rules closer to the business user, agility by the implementation teams is enabled.

What Is the Objective of the Data Vault 2.0 Model?

The objective is to provide seamless platform integration or at least make it available and possible via design. The design that is leveraged includes several basic elements. The first is found in the Data Vault 2.0 model, the use of the hash keys (to replace the surrogates as primary keys). The hash keys allow the implementation of parallel decoupled loading practices across heterogeneous platforms. The hash keys and loading process are introduced and discussed in the implementation and modeling sections of this chapter.

That said, the hash keys provide the connection between the two environments, allowing cross system joins to occur where possible. Performance of the cross system join will vary depending on the NoSQL platform chosen and the hardware infrastructure underneath. Fig. 6.3.2 shows an example data model that provides a logical foreign key between relational DBMS and Hadoop-stored satellite.

In other words, the idea is to allow the business to augment their current infrastructure by adding a NoSQL platform to the mix while retaining the value and use of their currently existing RDBMS engines, not to mention all the historical data they already contain.

What Are Hard and Soft Business Rules?

Business rules are the requirements translated into code. The code manipulates the data and in some cases turns data into information. Part of Data Vault 2.0 system of BI is to enable agility (which will be covered a bit more in the methodology section of this chapter). Agility is enabled by first splitting the business rules into two distinct groups: hard rules and soft rules (Fig. 6.3.3).

Fig. 6.3.3 Hard and soft business rules.

The idea is to separate data interpretation from data storage and alignment rules. By decoupling these rules, the team is enabled to be increasingly agile. Also, the business users can be empowered, and the business intelligence solution can be moved toward managed self-service BI.

Beyond that, the Data Vault 2.0-based data warehouse carries raw data, in a nonconformed state. That data are aligned with the business constructs known as business keys (which are defined in the data vault modeling section of this chapter).

The raw data, integrated by business keys, serve as a foundation for passing audits, especially, since the data set is not in a conformed format. The idea of the Data Vault 2.0 model is to provide for data-warehousing-based storage of raw data, so that if necessary (due to an audit or other needs), the team can reconstruct or reassemble the source system data.

This, in turn, makes the Data Vault 2.0-based data warehouse a system of record. Mostly because after warehousing the data from the source systems, those systems are either shut down or replaced by newer sources. In other words, the Data Vault 2.0 data warehouse becomes the only place where one can find the raw history integrated by business key.

How Does Managed Self Service BI Fit in the Architecture?

First, understand that self-service BI in and of itself is a misnomer. It emerged in the market in the 1990s as federated query engines, also known as enterprise information integration. While it is a grand goal, it never truly was able to overcome technical challenges that vendors touted it would. In the end, a data warehouse and business intelligence ecosystem are still needed in order to make accurate decisions. Hence, the term managed self-service BI is feasible and readily applicable to the solution space discussed in this book.

That said, Data Vault 2.0 architecture provides for the managed SSBI capabilities with the injection of write-back data (reabsorbing data on multiple levels) either from direct applications (sitting on top of the data warehouse) or from external applications like SAS, Tableau, QlikView, and Excel, where the data sets are physically “exported” from the tools after having been altered and fed back into the warehouse as another source.

The difference then is that the aggregations and the rest of the soft business rules rely on the new data in order to assemble the proper output for the business. The soft business rules (i.e., code layers) are managed by IT, while the processes are data-driven, and the business manages the data. An example of this can be found in the simple example of allowing businesses direct access to managing their own hierarchies.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6.3: Introduction to Data Vault Architecture

Create new playlist

Sign In