Chapter 6.5

Introduction to Data Vault Implementation

Abstract

One of the most important components of the end-state architecture is that of the data vault. The data vault exists to satisfy the need for rock-solid data integrity. Like all other components of the end-state architecture, the data vault has gone through its own evolution. And like all components of the end-state architecture, data vault will continue to evolve.

Keywords

Data vault; Big data; Patterns; Automation; Data warehousing; Business intelligence

Implementation Overview

The data vault system of BI provides implementation guidelines, rules, and recommendations as standards. As noted in previous sections of this chapter, well-defined standards and patterns are the key to success of agile, CMMI, Six Sigma, and TQM principles. These standards guide the implementation of

  •  the data model, finding business keys, designing entities, and applying key structures;
  •  the ETL/ELT load processes;
  •  the real-time messaging feeds;
  •  information mart delivery processes;
  •  virtualization of the information mart;
  •  automation best practices;
  •  business rules—hard and soft;
  •  write-back capabilities of managed self-service BI.

Some of the objectives of managing implementation through working practices include meeting the needs of TQM, embracing master data, and assisting in alignment across business, source systems, and the enterprise data warehouse.

Before going any further, it is necessary to understand that the highest level of optimization can only be reached if the process, design, and implementation are pattern-based and data-driven.

What's So Important About Patterns?

Patterns make life easier. In the enterprise BI world, patterns enable automation and generation while reducing errors and error potential. Patterns are the heartbeat of the Data Vault 2.0 system of business intelligence. Once the team has accepted the principles that building a data warehouse or BI system is the same as building software, it is possible to extend that thought to pattern-driven design.

A pattern is a recurring solution to a problem within a context.

Christopher Alexander

Think about it, how often have IT teams said they “need one pattern for loading history, one pattern for loading current, and yet another pattern for loading data in real-time?” Other teams have made the statement that “this part of the data model works for these reasons, and this other part of the data model was constructed differently because of exceptions to the design rules.” Much of this contributes to what is commonly called conditional architecture.

Conditional architecture is defined as a pattern that only works for a specific case often based on an IF condition. When the case changes (i.e., volume or velocity or variety) the boundaries, the architecture needs to change. Thus, conditional architecture is born.

Conditional architecture is a horrible way to construct/design an enterprise BI solution. The reason is because when volume grows and timelines (velocity changes) shrink, then reengineering takes place in order to rectify or correct the design. This leads to a solution that continues to cost more and more money and take longer and longer to change. In other words, it leads to a brittle architecture over time. This (especially in a big data solution) is a very bad construct.

At some point, the business can’t or won’t be able to pay for the reengineering costs. This is typically when the solution is torn down and rebuilt (green-field approach). With the patterns of Data Vault 2.0 (both architecture and implementation), rearchitecture and reengineering are avoided for 98% of the cases where volume grows, velocity changes, and variety increases.

Having the right pattern/design based on mathematical principles means that the team no longer suffers reengineering because of changing requirements.

Why Does Reengineering Happen Because of Big Data?

Reengineering/redesign/rearchitecture happens because big data pushes three of the four available axes in the following diagram. The more processing that has to happen in smaller and smaller time frames requires a highly optimized design. The more variety needing to be processed in smaller time frames also requires a highly optimized design. Finally, the more volume needing to be processed in smaller time frames (you guessed it) requires a highly optimized design.

Fortunately for the community, there is a finite set of process designs that have been proved to work at scale, and by leveraging MPP, scale-free mathematics, and set logic, these designs work both for small volumes and extremely large volumes without redesign.

Fig. 6.5.1 contains four axis labels: velocity, volume, time, and variety. In this figure, velocity is the speed of arrival of the data (i.e., latency of arrival); volume is the overall size of the data (on arrival to the warehouse); variety is defined to be the structural, semistructural, multistructural, or nonstructured classification of the data; and time is the allotted time frame in which to accomplish the given task (e.g., loading to the data warehouse). Let's examine a case study for how this impacts reengineering or even conditional architecture.

Scenario #1: Ten rows of data arrive every 24 hours, highly structured (tab delimited and fixed number of columns). The requirement is to load the data to the data warehouse within a 6-hour window. The question is as follows: how many different architectures or process designs can be put together in order to accomplish this task? For sake of argument, let's state that there are 100 possibilities (even typing the data in by hand or typing it in to Excel and then loading it to the database).

Fig. 6.5.1
Fig. 6.5.1 Architectural changes and reengineering.

The design that is chosen by this team is to type it in by hand to a SQL prompt as “insert into” statements.

Now, the parameters change:

Scenario #2: 1,000,000 rows of data, arriving every 24 hours, highly structured (tab delimited, fixed number of columns). The requirement is to load the data warehouse in a 4-hour window. The question is as follows: can the team use the same “process design” in order to accomplish the task?

Chances are the answer is no. The team must redesign, reengineer, and rearchitect the process design in order to accomplish the task in the allotted time frame. So, the redesign is complete. The team now deploys an ETL tool and introduces logic to loading the data set.

Scenario #3: One billion rows of data, arriving every 45 minutes, highly structured. The requirement is to load the data warehouse in a 40-minute time frame (otherwise the queue of incoming data backs up). The question again is as follows: can the team use the same “process design” they just applied, in order to accomplish this task? Can the team execute without redesign?

Again, most likely the answer is no. The team must once again redesign the process because it doesn’t meet the service level agreement (requirements). This type of redesign occurs again and again until the team reaches a CMMI level 5 optimized state for the pattern.

The problem is that any significant change to any of the axis’ on the pyramid causes a redesign to occur. The only solution is to mathematically find the right solution, the correct design that will scale regardless of time, volume, velocity, or variety. Unfortunately, this leads to unsustainable systems that try to deal (unsuccessfully) with big data problems.

The Data Vault 2.0 implementation standards hand these designs to the BI solution team, regardless of the technology underneath. The implementations or patterns applied to the designs for dealing with the data sets scale. They are based on mathematical principles of scale and simplicity, including some of the foundations of set logic, parallelism, and partitioning.

Teams that engage with the Data Vault 2.0 implementation best practices inherit the designs, as an artifact for big data systems. By leveraging these patterns, the team no longer suffers from rearchitecture or redesigns just because one or more of the axis/parameters change.

Why Do We Need to Virtualize Our Data Marts?

They should no longer be called data marts—they provide information to the business—therefore, they should be called information marts. There is a split between data, information, knowledge, and wisdom that should be recognized by the business intelligence community.

Virtualization means many things to many people. In this context, they are defined to be view-driven—whether or not they are implemented in a relational or nonrelational technology. Views are logical collections of data (mostly structured) on top of physical data storage. Note that it may not be a relational table anymore; it might be a key-value pair store or a nonrelational file sitting in Hadoop.

The more virtualization (or views) that can be applied, the quicker and more responsive the IT team is to change. In other words, less physical storage means less physical management and maintenance costs. It also means faster reaction time for IT to implement, test, and release changes back to the business.

What Is Managed Self-Service BI?

Unfortunately, there is a term called self-service BI being thrown about in the marketplace. This was in the 1990s, something applied to federated query engines—otherwise known as enterprise information integration (EII). The purpose and use for this type of engine has morphed in to the cloud and virtualization space.

One of the marketing statements in the 1990s (by these vendors) was as follows: “You don’t need a data warehouse…” The industry and the vendors learned that this simply isn’t a true statement. It wasn’t true then, and it certainly isn’t true now. Data warehouses (and business intelligence systems) are as important to the enterprise as the operational systems are, because the enterprise warehouse captures an integrated view of historical information, allowing gap analysis to occur across multiple systems.

If you give a child a bunch of finger paint (with no training and no instruction), will it make them a master artist, or will they simply make a big mess?

If a child is taught what to do with finger paint and where to paint—then provided some paper and paints—chances are they will paint on the paper instead of themselves. IT wants business to succeed; IT should be an enabler, helping to integrate the proper paints for the right colors and providing the paper along with basic instruction on how to get at the information (Fig. 6.5.2).

Fig. 6.5.2
Fig. 6.5.2 Illustrating managed self-service BI.

What the market realized is that IT is still needed in order to prepare the data, turn them in to information and make them usable by the business. IT is also needed to secure the data and offer access paths and encrypted information where necessary. Finally, IT is needed to assemble the data and integrate the historical data in an enterprise data warehouse. At the end of the day, managed self-service BI is necessary, because IT must manage the information and the systems being utilized by the business users.

Data Vault 2.0 provides the groundwork for understanding how to properly implement managed SSBI in enterprise projects. It covers the standards and best practices for achieving optimal goals.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset