Chapter 6.1

Introduction to Data Vault 2.0

Abstract

One of the most important components of the end-state architecture is that of the data vault. The data vault exists to satisfy the need for rock-solid data integrity. Like all other components of the end-state architecture, the data vault has gone through its own evolution. And like all components of the end-state architecture, data vault will continue to evolve.

Keywords

Big data; Data vault; Lockheed Martin; NoSQL; Link; Hub; Satellite

“Data Vault 2.0” is a system of business intelligence that includes modeling, methodology, architecture, and implementation best practices. The components also known as pillars of Data Vault 2.0 are identified as follows:

  •  Data Vault 2.0 modeling—focused on process and data models
  •  Data Vault 2.0 methodology—following Scrum and agile ways of working
  •  Data Vault 2.0 architecture—includes NoSQL and big data systems
  •  Data Vault 2.0 implementation—pattern-based automation and generation

The term “data vault” is merely a marketing term chosen in 2001 to represent the system to the market. The true name for the data vault system of business intelligence (BI) is common foundational warehouse modeling, methodology, architecture, and implementation.

The system includes aspects relating to the business of designing, implementing, and managing an enterprise data warehouse. With the Data Vault 2.0 (DV2) system, the organization can build incrementally, distributed or centralized, in the cloud or on-premise with disciplined agile teams.

Each of these components plays a key role in the overall success of an enterprise data warehousing program. These components are combined with industry-accepted practices rooted in Capability Maturity Model Integration (CMMI), Six Sigma, total quality management (TQM), Project Management Professional (PMP), and disciplined agile delivery.

Data Vault Origins and Background

Data vault was originally designed for use within Lockheed Martin, the US Department of Defense, National Security Agency, and NASA. The process started in 1990 and was completed in circa 2000. The entire system is composed of 10 years of research and development and over 30,000 test cases. The system is built to overcome the following issues:

  •  Integrate data from 250 + source systems from ADABAS, to PeopleSoft, to Windchill, to Oracle Financials, to mainframes and midranges, to SAP
  •  Provide an auditable and accountable data store and process engine
  •  Ingestion and query parsing of tagged image drawings (unstructured data)
  •  Rocket data fed in real time from the NASA launch pads
  •  Multilayered security—including classified data sets
  •  Subsecond query response times over 15 terabytes of live data
  •  Four hour turnaround from requirement to “hands-on” data in development for the report writers

These issues may not sound like much, but in 1997, we were dealing with 10BaseT networking as the “fastest” and best network; a 15 TB disk store was $250,000. Joining servers across the globe with subsecond query response times became imperative and challenging work. Flexibility to change and adapt was paramount.

Our team met the goals of the NSA and exceeded the expectations of all corporate management involved. Our team of five people ingested 150 source systems in under 6 months, built over 1500 reports, and delivered over 60,000 data attributes with 100% accountability and auditability. Today, with the better technology, this can be accomplished much easier, especially with the proper automation tooling. This global enterprise data warehouse is still there, still going strong, and of course much larger.

The “Old” Data Vault 1.0

Stepping back in time—in 2001—the Data Vault 1.0 standards were released. As of circa 2018, Data Vault 1.0 is now 17 years old; it is time to innovate. These standards were targeted at traditional relational database solutions on a small scale. In addition, the only standards released to the public were the Data Vault 1.0 model standards.

Data Vault 1.0 modeling utilizes sequence numbering schemes that fails to properly perform under large-volume load cycles. Furthermore, sequence numbering techniques limit the team's ability to distribute the data vault model onto hybrid platforms (on-premise/in-cloud) or onto geographically distributed platforms.

Enter: Data Vault 2.0

The New and Updated Data Vault 2.0

Since 2001, the technology, platforms, capabilities, and hardware have all changed and shifted. Today's focus is on much larger big data systems, NoSQL platforms, and better processing of unstructured/semistructured data. The methodology has been brought up to date to include disciplined agile delivery (from Mark Lines and Scott Ambler). The architecture includes landing zones, data lakes, and hybrid solution designs.

The data vault has evolved—just like the web, just like automobiles, or just like any system. Data vault is now considered to be at a stable 2.0 release and includes (as mentioned previously) model, methodology, implementation, and architecture. Data Vault 2.0 (DV2) is a foundational system that provides programs and projects with the knowledge and foresight to implement successful enterprise data warehouses.

The issues Data Vault 2.0 is built to solve include the following:

  •  Global distributed teams
  •  Global distributed physical data warehouse components
  •  “Lazy” joining during query time across multicountry servers
  •  Ingestion and query parsing of images, video, audio, and documents (unstructured data)
  •  Ingestion of real-time streaming (IOT) data
  •  Cloud and on-premise seamless integration
  •  Agile team delivery
  •  Incorporation of data virtualization and NoSQL platforms
  •  Extremely large data sets (into the petabyte ranges and beyond)
  •  Automation and generation of 80% of the work products

From a business perspective, DV2 brings the entire solution to the table—not only the data model but also the workflow, processes, automation, standardization, adaptability, architectural flexibility, agility, and more. From a business perspective, these components can no longer be ignored. Cobbling together multiple different methodologies and hoping for success rarely work.

DV2 brings tried and tested successes, empirical evidence that will not suffer the consequences of reengineering. DV2 also brings confidence from customers, based on solid reliable engineering implementations. Data Vault 2.0 offers all of this, including customer references (some of the largest commercial organizations and government data stores in the world).

This might sound like overkill; however, the team (once properly trained) can deliver sprint work products in a one- and two-day life cycle. The solution is foundational and offers building block components that easily fit together in a standardized fashion. Accelerating the teams' progress by leveraging automation and workflow process tooling (specifically with Data Vault 2.0 authorized tools) becomes a must-do.

Today, there are customers around the world whom have implemented petabyte-level distributed Data Vault 2.0 solutions with some of the latest big data technology. More information from business to technical and from tooling to data platforms can be found in the data vault community: http://DataVaultAlliance.com (free to join).

What Is Data Vault 2.0 Modeling?

A Business View

The data vault model is based on a business concept model. Capturing concepts or elements of the business needs to be unified at a logical level and then mapping those concepts to the raw data level and the business process levels. The concept model starts with an individual data item like a customer, product, or service. Then, these concepts are uniquely identified by business keys that travel across the lines of business (from data inception to data “death”).

The model separates relationships or associations (links) from identifiers (hubs) from the descriptive data that change over time (satellites). This allows the model to store commonly defined data sets mapped to a concept level and ties that data to multiple business process levels. These business processes are the ones that execute within the source systems.

By capturing the data set in this manner, the model can easily represent multiple logical units of work, along with shifting business hierarchies and shifting processes. Furthermore, the conceptual model can be applied in automation and generation tools, data virtualization tools, and query tools to better meet the needs of the enterprise.

Because this model (at a build process level) is focused on concepts, it can be split or divided into parallel work streams. The model can be built incrementally over time with little to no reengineering efforts when change arrives. The model can be automatically generated (with human input around the concepts and business keys), to expedite and accelerate the process.

A Technical View

The data vault modeling is a hybrid approach based on third normal form and dimensional modeling aimed at the logical enterprise data warehouse. The data vault model is built as a ground-up, incremental, and modular models that can be applied to big data, structured, and unstructured data sets.

DV2 modeling is focused on providing flexible, scalable patterns that work together to integrate raw data by business key for the enterprise data warehouse. DV2 modeling includes minor changes to ensure the modeling paradigms can work within the constructs of big data, unstructured data, multistructured data, and NoSQL.

Data Vault Modeling 2.0 changes the sequence numbers to hash keys. The hash keys provide stability, parallel loading methods, and decoupled computation of parent key values for records. There is an alternative for engines that hash business key values internally—the option of utilizing the true business keys as they are, without sequences or hash surrogates. The pros and cons of each technique will be detailed in the data vault modeling section of this chapter.

How Is Data Vault 2.0 Methodology Defined?

A Business View

The methodology utilizes best practices from software development best practices such as CMMI, Six Sigma, TQM, Lean Initiatives, and cycle time reduction and applies these notions for repeatability, consistency, automation, and error reduction.

DV2 methodology focuses on rapid sprint cycles (iterations) with adaptations and optimizations for repeatable data warehousing tasks. The idea of DV2 methodology is to enable the team with agile data warehousing and business intelligence best practices. DV2 encompasses methodology as a pillar or key component to achieve the next level of maturity in the data warehousing platform.

Other methodologies are available for use; however, the DV2 methodology is uniquely geared to leverage the benefits of the DV2 model, process designs, and much more.

A Technical View

The methodology (like the modeling components) is based on solid repeatable process designs. These designs require little to no reengineering and can handle scale-out, scale-up, parallelism, and real time with ease. The methodology is also geared around the people. From a technical standpoint, there is nothing better than having an agile team, capable of implementing and rapidly scaling a solution.

Tooling that is offered by both AnalytiX DS and WhereScape assists the team from the process perspective. Automation and generation tooling is beneficial in increasing the delivery speed by a factor of four times (minimum).

Why Do We Need a Data Vault 2.0 Architecture?

Data Vault 2.0 architecture is designed to include NoSQL (think: big data, unstructured data, multistructured, and structured data sets). Seamless integration points in the model, and well-defined standards for implementation offer guidance to the project teams.

DV2 architecture includes NoSQL, real-time feeds, and big data systems for unstructured data handling and big data integration. The DV2 architecture also provides a basis for defining what components fit where and how they should integrate. In addition, the architecture provides a guideline for incorporating aspects such as managed self-service BI, business write back, natural language processing (NLP) result set integration, and direction for where to handle unstructured and multistructured data sets.

Where Does Data Vault 2.0 Implementation Fit?

DV2 implementation focuses on automation and generation patterns for time-savings, error reduction, and rapid productivity of the data warehousing team. The DV2 implementation standards provide rules and working guidelines for high-speed reliable build-out with little to no errors in the process. The DV2 implementation standards dictate where and how specific business rules are to execute in the process chain, indicating how to decouple the business changes or data provisioning from data acquisition.

What Are the Business Benefits of Data Vault 2.0?

There are hundreds of benefits, far too many to list—all of which are drawn from the existing best practices of CMMI, Six Sigma, TQM, PMP, Agile/Scrum, automation, and so on. However, the reason for Data Vault 2.0 system of business intelligence can be nicely summed up in one word: maturity.

Maturity of the business intelligence and data warehousing systems require the following key elements:

  •  Repeatable patterns
  •  Redundant architecture/fault-tolerant systems
  •  High scalability
  •  Extreme flexibility
  •  Managed consistent costs for absorbing changes
  •  Measurable key process areas (KPAs)
  •  Gap analysis (for the business of building data warehouses)
  •  Incorporation of big data and unstructured data

From a business perspective, Data Vault 2.0 addresses the needs of big data, unstructured data, multistructured data, NoSQL, and managed self-service BI. Data Vault 2.0 really is targeted at the evolution of the enterprise data warehousing (EDW) and business intelligence (BI). Data Vault 2.0’s goal is to mature the processes of building BI systems for the enterprise in a repeatable, consistent, and scalable fashion while providing seamless integration with new technologies (i.e., NoSQL environments).

The resulting business benefits include (but are not limited to) the following:

  •  Lowering total cost ownership (TCO) for EDW/BI programs
  •  Increasing agility of the entire team (including delivery)
  •  Increasing transparency across the program

The resulting business benefits can be found in the following categories:

Data Vault 2.0 agile methodology benefits:

 Drives agile deliveries (2/3 weeks)

 Includes CMMI, Six Sigma, and TQM

 Manages risk, governance, and versioning

 Defines automation and generation

 Designs repeatable optimized processes

 Combines best practices for BI

Data Vault 2.0 model benefits:

 Follows scale-free architecture

 Based on hub and spoke design

 Backed by set logic and massively parallel processing (MPP) math

 Includes seamless integration of NoSQL data sets

 Enables 100% parallel heterogeneous loading environments

 Limits impacts of changes to localized areas

Data Vault 2.0 architecture benefits:

 Enhances decoupling

 Ensures low impact changes

 Provides managed self-service BI

 Includes seamless NoSQL platforms

 Enables team agility

Data Vault 2.0 methodology benefits:

 Enhances automation

 Ensures scalability

 Provides consistency

 Includes fault tolerance

 Provides proved standards

Unlabelled Table

What Is Data Vault 1.0?

Data Vault 1.0 (DV1) is highly focused on the data vault modeling components and relational database technology. A DV1 data model attaches surrogate sequence keys as its primary key selection for each of the entity types. Unfortunately, surrogate sequences exhibit the following problems:

  •  Introduce dependencies on the ETL/ELT loading paradigm
  •  Contain an upper bound/upper limit, when reached can cause issues
  •  Are meaningless numbers (mean absolutely nothing to the business)
  •  Cause performance problems (due to dependencies) on load of big data sets
  •  Reduce parallelism (again due to dependencies) of loading processes
  •  Cannot be utilized as MPP partition keys for data placement, to do so would potentially cause hot spots in the MPP platform
  •  Cannot be reliably rebuilt or reassigned (reattached to old values) during recovery loads
  •  Are disparate across multiple source applications that are housing the same data sets

DV1 does not meet the needs of big data, unstructured data, semistructured data, or very large relational data sets. DV1 is highly focused on just the data modeling section and relational databases.

Are surrogate sequences a bad thing to utilize? No, if the data set is small (less than 100M records per table) or if the platform is capable of scaling compute power beyond traditional methods (reducing the cost of a lookup on load). Sequences do work very well for high-performance queries, and most traditional relational engines utilize this to their advantage when data are partitioned by range.

There are platforms where sequences are discouraged and in fact not even available. In those platforms, alternative key structures are needed. The alternative key structure proposed is in fact a hash key, which is discussed in detail, later in this chapter. A third alternative is to utilize the natural business key directly from the source system. This too has its pros and cons and will also be addressed later on in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset