Chapter 6.2

Introduction to Data Vault Modeling

Abstract

One of the most important components of the end-state architecture is that of the data vault. The data vault exists to satisfy the need for rock-solid data integrity. Like all other components of the end-state architecture, the data vault has gone through its own evolution. And like all components of the end-state architecture, data vault will continue to evolve.

Keywords

Business context; Source system restructuring; Data vault model; Dimensional modelling; Data warehouse; Lean initiative; Satellite; Relationship table; Precaching; Hash keys; Business keys

What Is a Data Vault Model Concept?

From a conceptual level, the data vault model (DVM) is a hub-and-spoke-based model, designed to focus its integration patterns around business keys. These business keys are the keys to the information stored across multiple systems (hopefully the master keys), utilized to locate and uniquely identify records or data. At a conceptual level, these business keys are stand-alone, meaning they don’t rely on other information to exist.

The concepts are derived from business context (or business ontologies), elements that make sense to the business from a master data perspective like customer, product, and service. These concepts are business drivers at the lowest level of grain. The DVM is built to house data at the level of granularity of the source systems.

The DVM should never be simply designed as a “source system restructuring.” If there is no integration by business keys, then there is no point in building a DVM. Business keys need to reflect the concepts as they are defined within the business taxonomy. These taxonomy hierarchies define the context where the business keys live, along with their granularity.

An example of this might be “customer account number” (CAN). In a perfect world, the CAN would be assigned once and never change—it would also be assigned to the same customer, forever. This would be the ultimate in master data management. Regardless of the business process or source application that is processing the CAN, the value would remain the same. Once this occurs, tracing the CAN through the life cycle of the business is an easy undertaking.

Data Vault Model Defined

A DVM is a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. In DV2, the model entities are keyed by hashes, while in DV1, the model entities are keyed by sequences.

The modeling style is a hybrid of third normal form and dimensional modeling techniques—uniquely combined to meet the needs of the enterprise. The DVM is also based on patterns found in hub-and-spoke-type diagramming, otherwise known as scale-free network design (Fig. 6.2.1).

Fig. 6.2.1
Fig. 6.2.1 Data vault conceptual model.

These design patterns enable the data vault model to inherit scale-free attributes that impose no known inherent limitation on the size of the model or the size of data that the model can represent—other than those limitations introduced by the infrastructure.

Components of a Data Vault Model

There are three basic entities or structures in a data vault model: hub, link, and satellite. In business terms, the hubs represent the actual business keys or master key sets that exist across the enterprise in a horizontal fashion. The links represent the relationships and associations that exist across the business keys in the enterprise. The real data warehousing components are the satellites that store nonvolatile data over time.

The data vault model is based on normalization and separation of classes of data. In this particular case, the business keys (hubs) are considered a different class than the relationships (links). Both of these types are separated by context or descriptive information (satellites) that have a tendency to change over time.

What Makes Business Keys So Interesting?

Business keys are the drivers in business. They tie the data set to the business processes and the business processes to the business requirements. Without business keys, the data set has no value. Business keys are the sole source of tracking the data through the business processes and across lines of business.

Fig. 6.2.2 represents several concepts that exist within a full-scale enterprise. The blue boxes represent hundreds of individual business processes that occur within the business. The blue boxes are tied together through a business process life cycle. The objective of organizing the business processes in this manner is to identify the critical path (as these processes are engaged in cycle time reduction and lean initiatives). The critical path of the business processes is depicted by the red dotted line.

Fig. 6.2.2
Fig. 6.2.2 Business keys across lines of business.

Note that the critical path is an important part to overhead cost reduction, time to market delivery, and quality improvements. By identifying the critical path in the organization, the company can achieve “better, faster, and cheaper” throughout the enterprise. Identifying and tracking the data through the business processes is necessary to trace the critical path through the company's business processes.

Within each of the business processes, the business keys are identified. Business keys are how the source systems and the individuals in the organization track and manage the data/contracts underneath. The business keys (in this example, SLS123) originate in the sales systems. As illustrated in this example, when these keys cross the process boundaries from sales to procurement, a manual process takes place. The result of the manual process is a change to the example business key from SLS123 to ⁎P123MFG. Unfortunately, the manual change to the business keys in this example is not recorded anywhere except in an external Excel spreadsheet.

What Does This Have to Do With Data Vault and Data Warehousing?

As stated earlier, the objective is to lower total cost of ownership (TCO) in the organization. TCO translates to reducing overhead costs, increasing quality of the deliverable, and decreasing the amount of time to deliver the product or service. A properly designed and implemented data vault data warehouse can help with these tasks, including the discovery and tracing activities needed to identify the critical path.

The ability to track and trace the data set across multiple lines of business is part of creating value or establishing data as an asset on the books. Without traceability back to the business processes, data become nearly valueless.

Introducing critical path analysis in the business and establishing traceability across multiple lines of business mean the organization can engage in cycle time reduction (or lean initiatives); these initiatives aid the organization in identifying their critical path and eliminating business processes that add no value and are apt to slowing down the production and delivery of the product or service. Understanding the path of the data (identified by business keys) across the multiple lines of business can show the critical path and the long-standing business processes that need to be addressed in cycle time reduction efforts.

By linking the business processes to the data through business keys, it is not only easier to assign value but also easier to understand the gaps in the business perception (i.e., requirements they provide the EDW team. The gaps expose the reality of what multiple source systems are capturing and executing on.

One of the end results from this process is to help (hopefully) understand where the business may be hemorrhaging money. When businesses close the gap through TQM best practices, they stop the money loss and potentially increase revenue and quality of the product or service at the same time.

How Does This Translate to Data Vault Modeling?

The data vault model, more specifically the hub table, shows how many different keys there are across the entire business. The hub table tracks when each key is inserted to the warehouse and from what source application it arrived. The hub table does not track anything else. To understand the “changing of the business keys” from one line of business to another, the data warehouse needs another table structure and another source feed.

The next table structure that the data vault uses is called a link. The link structure houses the feed from the manual process, FROM SLS123 TO ⁎P123MFG, otherwise known as a same-as link structure.

An example Data Vault 2.0 model for this case is shown in Fig. 6.2.3.

Fig. 6.2.3
Fig. 6.2.3 Data Vault 2.0 data model.

The hub customer represents the two keys found in the business process. The link same-as shows their connection. The record source (Rec Src) shows “Joes Excel,” meaning the data vault load processed an Excel spreadsheet based on the manual process. The link satellite effectivity provides the time line for when the relationship between the two keys starts and stops. Satellites are where the descriptive data live and breathe.

Satellites carry more than just effectivity. In the case of customer (not shown here), the satellite may carry additional descriptive details such as the customer name, address, and phone number. Other examples of satellites and satellite data are provided further in this section.

The link in this case carries the key matches (from and to). This type of link structure can be utilized to connect master key selections or to explain the key mapping/changing from one source system to another. It can also be utilized to represent multilevel hierarchies (not shown here).

Note that importing the Excel spreadsheet shows the first step toward managed self-service BI (managed SSBI). Managed SSBI is the next step in the evolution of data warehousing. Allowing the business users to interact with the raw data sets in the warehouse and affect their own information marts by changing the data.

The data vault model not only provides immediate business value but also is capable of tracking all relationships over time. It demonstrates the different hierarchies of data (even though this is highly focused on two particular business keys at the moment) that are possible when loading into the warehouse.

By tracking the changes to business keys that exposes the relationship across and between business keys, the business can then begin to ask and answer the following questions:

  •  How long does my customer account stay in sales before it is passed to procurement?
  •  Can I compare an AS-SOLD image with an AS-CONTRACTED image and an AS-MANUFACTURED image with an AS-FINANCED image?
  •  How many customers do I actually have?
  •  How long does it take for a customer/product/service to make it from initial sale to final delivery in my business?

Many of these questions cannot be answered without a consistent business key that spans the different lines of business.

Why Restructure the Data From the Staging Area?

Restructuring allows integration across multiple systems into a single place in the target data warehouse without changing the data set itself (i.e., no conformity). This is called passive integration. Data are considered passively integrated by business key because there is no change to the raw data. It is integrated according to the location (i.e., all individual customer account numbers will exist in the same hub, while all corporate customer account numbers exist in a different hub).

In the age of big data, staging areas are also known as landing zones, data dumps, or data junkyards. Staging areas are a logical concept that can manifest themselves physically in multiple environments. A “staging area” may be a file store on Amazon S3 or an Azure Cloud, or it may be a Hadoop distributed file system (HDFS). It may also be a relational database table structure. Staging areas focus the data in a single concept in preparation for moving the data downstream.

What Are the Basic Rules of the Data Vault Model?

There are some fundamental rules in data vault modeling that must be followed, or the model itself no longer qualifies to be a data vault model. These rules are documented in a classroom environment in full. However, some of the rules are listed below:

  1. (1) Business keys are separated by GRAIN and semantic meaning. That means customer corporation and customer individual must exist or be recorded in two separate hub structures.
  2. (2) Relationships, events, and intersections across two or more business keys are placed into link structures.
  3. (3) Link structures have no begin or end dates; they are merely an expression of the relationship at the time the data arrived in the warehouse.
  4. (4) Satellites are separated by the type of data/classification and rate of change. Type of data is typically a single source system.

Raw data vault modeling does not allow nor provide for such concepts or notions as conformity, nor does it deal with super types. Those concepts lie within the business vault models (another form of data vault modeling that is used as an information delivery layer).

Why Do We Need Many to Many Link Structures?

Many-to-many link structures allow the data vault model to be future proof/extendable. The relationships expressed in source systems are often a reflection of business rules or business execution today. The relationship definition has changed over time and will continue to change. To represent both historical and future data (without reengineering the model and the load routines), many-to-many relationship tables are necessary.

This is how the Data Vault 2.0 data warehouse can expose the patterns of relationship changes over time answering questions like where is the gap between “current requirements” and “relationships” in history? The many-to-many table (link) in the raw data vault provides metrics around what percentage of data are “broken” and when that data break the relationship requirement.

For example, let's say that in the past, for a sample customer, it was common to have one portfolio manager. Today, however, the company has changed the business rule so that there might be three or more portfolio managers assigned to a customer. If the data warehouse model enforces the “past” relationship (many customers to one portfolio manager), then to support today's relationship the data model and the ELT/ETL loading routines would have to be reengineered.

There is a reason for having a many-to-many relationship implemented in a link table without the descriptive attributes attached. That reason would be to catch discrepancies across multiple source systems. A link table (for purposes of understanding) may be thought of as a relationship table. There are several forms of link structures including nonhistorized links, hierarchical links, and same-as links. These forms are functionally defined because they are defined in a manner that indicates the type of function or role the data in these structures play.

Reengineering results in ever-increasing amounts of money because as the data set grows and the model grows, the time, complexity, and cost of modifications also grow. Eventually, this increase in cost and time to maintain the data outgrows the business’ ability to pay.

The only way to represent both relationships (historical and future) over time is to place the data in a many-to-many link table; then, based on query requirements provided by data marts downstream, the warehouse can tell the business users exactly what they have and when it “breaks” the current rule.

Primary Key Options for Data Vault 2.0

There are three main alternatives for selecting primary key values in a Data Vault 2.0 model:

  •  Sequence numbers
  •  Hash keys
  •  Business keys

Sequence Numbers

Sequence numbers have been around since the beginning of machines. They are system-generated, unique numeric values that are incremental (sequential) in nature. Sequence numbers have the following issues:

  •  Upper limit (the size of the numeric field for nondecimal values).
  •  Introduce process issue when utilizing sequences during load because they require any child entity to look up its corresponding parent record to inherit the parent value.
  •  Hold no business meaning.

The most critical of the issues above is that of negative performance impacts associated with lookup or join processes, particularly in heterogeneous environments or in environments where data are legally not allowed to “live” or be replicated on to other environments (geographically split or on-premise and in-cloud mix). This process issue is exacerbated during high-speed IOT or real-time feeds. Consider what happens in an IOT or real-time feed when data flow quickly to billions of child records, and each record must then wait on a sequence “lookup” (one record at a time); the real-time stream may back up.

Lookups also cause “precaching” problems under volume loads. For example, suppose the parent table is invoice and the child table is order. If the invoice table has 500 million records and the order table has 5 billion records and each order has at least one matching parent row (most likely more), then each record that flows into order must “look up” at least one invoice. This lookup process will happen 5 billion times, once for each child record.

It doesn’t matter if the technology is an ETL engine, real-time process engine, or SQL data management-enabled engine. This process must happen to avoid any potential orphan records. If the referential integrity is shut off, the load process can run in parallel to both tables. However, to populate the “parent sequence,” it must still be “searched/looked up” on a row-by-row basis. Adding parallelism and partitioning will help with the performance, but eventually, it will hit an upper limit bottleneck.

In an MPP environment (MPP storage), the data will be redistributed to allow the join to occur, and it is not just the sequence that has to be shipped—it’s the sequence PLUS the entire business key that it is tied to. In an MPP engine with non-MPP storage (like snowflake DB), the data don’t have to be shipped, but the lookup process still must happen.

This act of a single-strung, one record at a time lookup can tremendously (and negatively) impact load performance. In large-scale solutions (think of 1000 “tables” or data sets each with 1 billion records or more), this performance problem is dramatically increased (load times are dramatically increased).

What if there is one child table? What if the data model design has parent->child->child->child tables? Or relationships that are multiple levels deep? Then, the problem escalates as the length of the load cycles escalates exponentially.

To be fair, let's now address some of the positive notions of utilizing sequence numbers. Sequence numbers have the following positive impacts once established:

  •  Small byte size (generally less than number(38)) (38 “9’s”) or 10  125.
  •  Process benefit: joins across tables can leverage small byte size comparisons.
  •  Process benefit: joins can leverage numeric comparisons (faster than character or binary comparisons).
  •  Always unique for each new record inserted.
  •  Some engines can further partition (group) in ascending order the numerical sequences and leverage subpartition (micropartition) pruning by leveraging range selection during the join process (in parallel).

Hash Keys

What is a hash key? A hash key is a business key (may be composite fields) run through a computational function called a hash and then assigned as the primary key of the table. Hash functions are called deterministic. Being deterministic means that based on given input X (every single time the hash function is provided X), it will produce output Y (for the same input, the same output will be generated). Definitions of hash functions, what they are and how they work, can be found on Wikipedia.

Hash key benefits to any data model:

  •  100% parallel independent load processes (if referential integrity is shut off) even if these load processes are split on multiple platforms or multiple locations.
  •  Lazy joins—that is, the ability to join across multiple platforms utilizing technology like drill (or something similar)—even without referential integrity. Note that lazy joins can’t be accomplished across heterogeneous platform environments and aren’t even supported in some NoSQL engines.
  •  Single field primary key attribute (same benefit here as the sequence numbering solution).
  •  Deterministic—it can even be precomputed on the source systems or at the edge for IOT devices/edge computing.
  •  Can represent unstructured and multistructured data sets—based on specific input hash keys can be calculated again and again (in parallel). In other words, a hash key can be constructed as a business key for audio, images, video, and documents. This is something sequences cannot do in a deterministic fashion.
  •  If there is a desire to build a smart hash function, then meaning can be assigned to bits of the hash (similar to teradata—and what it computes for the underlying storage and data access).

Hash keys are important to Data Vault 2.0 because of the efforts to connect heterogeneous data environments such as Hadoop and Oracle. Hash keys are also important because they remove dependencies when “loading” the Data Vault 2.0 structures. A hash key can be computed value by value. The “parent” key can also be computed and can be repeated for as many parent keys as there exist values for. There is no lookup dependency, no need to precache, use the temp area, or anything else to calculate each parent value during load processing.

Big data system loads are nearly impossible to scale properly with sequence numbering dependencies in place. Sequences (whether they are in DV1 or dimensional models or any other data model) force the parent to be loaded and then the child structures. These dependencies on “parent first—then lookup parent value” cause a sequential row-by-row operation during the load cycles, thereby inhibiting the scale-out possibilities that parallelism offers.

This type of dependency not only slows the loading process down but also kills any potential for parallelism—even with referential integrity shutoff. Furthermore, it places a dependency into the loading stream in heterogeneous environments. For instance, when loading satellite data into Hadoop (perhaps a JavaScript Object Notation (JSON) document), the loading stream requires a lookup for the sequence number from the hub that may exist in a relational database. This dependency alone defeats the entire purpose of having a system like Hadoop in the first place.

Hash keys do have their issues:

  •  Length of the resulting computational value when the storage for the hash is greater than sequences.
  •  Possible collision (probabilities of collision are dependent on the hashing function chosen for utilization).

The first issue leads to slower SQL joins and slower queries. This is because it takes longer to “match” or compare longer length fields than it does to compare numerics. Hashes (in Oracle and SQL Server) are typically stored in fixed binary form (yes, this works as a primary key). Hashes in Hive or other Hadoop-based technologies and some other relational engines must store the hashes as fixed character set lengths. For example, an MD5 hash result is BINARY(16), which results in CHAR(32) fixed length hexadecimal encoded string.

The flip side of using a hash is its unlimited scalability in parallel loading. All data can be loaded in complete parallel all the time across multiple platforms (even those that are geographically split or split on-premise and in-cloud). Hash keys (or business keys) are part of the success of Data Vault 2.0 in a big data and NoSQL world. Hashing is optional in DV2. There are a variety of hashing algorithms available for use that include the following:

  •  MD5 (deprecated circa 2018)
  •  SHA 0, 1, 2, and 3—SHA1 (deprecated circa 2018)
  •  Perfect hashes

The hash is based on the business keys that arrive in the staging areas. All lookup dependencies are hence removed, and the entire system can load in parallel across heterogeneous environments. The data set in the model now can be spread across MPP environments by selecting the hash value as the distribution key. This allows for better mostly random, mostly even distribution across the MPP nodes if the hash key is the MPP bucket distribution key.

“When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the chi-squared test.” https://en.wikipedia.org/wiki/Hash_function

Luckily, the hash functions are already designed, and the designers have taken this bit of distribution mathematics into account. The hashing function chosen (if hashing is to be utilized) can be at the discretion of the design team. As of circa 2018, teams have chosen SHA-256.

One of the items discussed is the longer the hashing output (number of bits), the less likely/less probable for a potential collision. This is something to take into consideration, especially if the data sets are large (e.g., big data, 1 billion records on input per load cycle per table).

If a hash key is chosen for implementation, then a hash collision strategy must also be designed. This is the responsibility of the team. There are several options available for addressing hash collisions. One of the recommended strategies is reverse hash.

This is just for the Data Vault 2.0 model that acts as the enterprise warehouse. It is still possible (and even advisable) to utilize or leverage sequence numbers in persisted information marts (data marts) downstream to engage fastest possible joins within a homogeneous environment.

The largest benefit isn’t from the modeling side of the house; it's from the loading and querying perspectives. For loading, it releases the dependencies and allows loads to Hadoop and other NoSQL environments in parallel with loads to RDBMS systems. For querying, it allows “late-join” or run-time binding of data across Java database connectivity (JDBC) and open database connectivity (ODBC) between Hadoop, NoSQL, and RDBMS engines on demand. It is not suggested that it will be fast, but rather that it can be easily accomplished.

Deeper analysis of this subject is covered in Data Vault 2.0 boot camp training courses and in Data Vault 2.0 published materials. It is beyond the scope of this book to dive deeper into this subject.

Business Keys

Business keys have been around for a long time, if there have been data in operational applications. Business keys should be smart or intelligent keys and should be mapped to business concepts. That said, most business keys today are source system surrogate IDs, and they exhibit the same problems that sequences mentioned above exhibit.

A smart or intelligent key is generally defined as a sum of components where digits or pieces of a single field contain meaning to the business. At Lockheed Martin, for example, a part number consisted of several pieces (it was a superkey of sorts). The part key included the make, model, revision, and year of the part, like a vehicle identification number (VIN) found on automobiles today.

The benefits of a smart or intelligent key stretch far beyond the simple surrogate or sequence business key. These business keys usually exhibit the following positive behavior at the business level:

  •  They hold the same value for the life of the data set.
  •  They do not change when the data are transferred between and across business OLTP applications.
  •  They are not editable by business (most of the time) in the source system application.
  •  They can be considered master data keys.
  •  They cross business processes and provide ultimate data traceability.
  •  Largest benefit can allow parallel loading (like hashes) and also work as keys for geographically distributed data sets—without needing recomputation or lookups.

They do have three downfalls: (a) length, generally, smart business keys can be longer than 40 characters; (b) meaning over time, the base definition can change every 5–15 years or so (just look at how VIN number has evolved over the last 100 years); (c) sometimes, source applications CAN change the business keys, which wreaks havoc on any of the analytics that need to be done.

If given the choice between surrogate sequences, hashes, and natural business keys, natural business keys would be the preference. The original definition (even today) states that a hub is defined as a unique list of business keys. The preference is to use natural business keys that have meaning to the business.

One of the functions of a properly built raw Data Vault 2.0 model is to provide traceability across the lines of business. To do this, the business keys must be stored in the hub structures according to a set of design standards.

Most of the business keys in the source system today are surrogate sequence numbers defined by the source application. The world is full of these “dumb” machine-generated numeric values. Examples include customer number, account number, invoice number, and order number, and the list goes on.

Source System Sequence Business Keys

Source system sequence-driven business keys make up 98% of the source data that any data warehouse or analytic system receives. Even down to transaction ID, e-mail ID, or some of the unstructured data sets, such as document ID, contain surrogates. The theory is that these sequences should never change and should represent the same data once established and assigned.

That said, the largest problem that exists in the operational systems is one the analytic solution is always asked to solve, that is, how to integrate (or master) the data set, to combine it across business processes and make sense of the data that have been assigned multiple sequence business keys throughout the business life cycle.

An example of this may be customer account. Customer account in SAP may mean the same thing as customer account in Oracle Financials or some other customer relationship management (CRM) or enterprise resource planning (ERP) solution. Generally, when the data are passed from SAP to Oracle Financials, typically, the receiving OLTP application assigns a new “business key” or surrogate sequence ID. It's still the same customer account; however, the same representative data set now has a new key.

The issue becomes as follows: how do you put the records back together again? This is a master data management (MDM) question and with an MDM solution in place (including good governance and good people) can be solved and approximated with deep learning and neural networks. Even statistical analysis of “similar attributes” can detect within a margin of error the multiple records that “should” be the same but contain different keys.

This business problem perpetuates into the data warehouse and analytic solution typically because no master data management solution has been implemented upstream of the data warehouse. Therefore, to put together what appears to be “one version of the customer record” and not double or triple count, algorithms are applied to bridge the keys together.

In the data vault landscape, we call this a hierarchical or same-as link, hierarchical if it represents a multilevel hierarchy and same-as if it is a single hierarchy (parent to child remap) of terms.

Placing these sequence numbers as business keys in hubs have the following issues:

  •  They are meaningless—a human cannot determine what the key stands for (contextually) without examining the details for a moment in time.
  •  They can change—often they do, even with something as “simple” as a source system upgrade—this results in a serious loss of traceability to the historical artifacts. Without an “old-key” to “new-key” map, there is no definitive traceability.
  •  They can collide. Even though conceptually across the business there is one element called “customer account,” the same ID sequence may be assigned in different instances for different customer accounts. In this case, they should never be combined. An example of this would be two different implementations of SAP: one in Japan and one in Canada. Each assigns customer ID #1; however, in Japan's system, #1 represents “Joe Johnson,” whereas in Canada's system, #1 represents “Margarite Smith.” The last thing you want in analytics is to “combine” these two records for reporting just because they have the same surrogate ID.

An additional question arises if the choice is made to utilize data vault sequence numbers for hubs and the source system business keys are surrogates. The question is as follows: why “rekey” or “renumber” the original business key? Why not just use the original business key (which by the way is how the original hub is defined)?

To stop the collision (as put forward in the example above)—whether a surrogate sequence, a hash key, or the source business key is chosen for the hub structure—another element must be added. This secondary element ensures uniqueness of this surrogate business key. One of the best practices here is to assign geography codes, for example, JAP for any customer account IDs that originate from Japans’ SAP instance and CAN for any customer account IDs that originate from Canadas’ SAP instance.

Multipart Source Business Keys

Using a geographic code, as mentioned above, brings up another issue. If the hub is created based solely on source system business key (and not surrogate sequence or hash key), then with the choice above (to add a geography code split), the model must be designed and built with a multipart business key.

The issue with a multipart business key is with performance of a join. There are multiple mathematical tests and quantitative results that show time and time again that multifield join criteria are slower than single field join criteria. It only goes “slower” in large volume or big data solutions. At this point, perhaps, a hash key or surrogate sequence in the data vault may be faster than a multifield join because it reduces the join back to a single field value.

Another alternative is to concatenate the multifield values together, thus forming somewhat of an intelligent key, either with or without delimiters. This would depend on how the business wishes to define a set standard for concatenating the multifield values (i.e., the rules needed—just like the rules needed to define a smart key).

The last thing to watch when choosing a multipart business key is the length of the combined or concatenated field. If the length of the concatenated fields is longer than the length of a hash result or surrogate sequence ID, then the join will execute slower than a join on a shorter field. As a reminder, these differences in performance usually can only be seen in large data sets (500 M or 1 billion records or more). The hardware has advanced and will continue to advance so much so that small data sets exhibit good performance. There is simply not enough of a difference in a small data set to make an informed decision about the choice of the “primary key” for the hubs.

The suggestion ultimately is to rekey the source data solutions and add a smart or intelligent key “up front” that can carry the data across instances, across business processes, across upgrades, through master data, across hybrid environments, and never change. Doing this would centralize and ease the pain and cost of “master data” and would lead to easier use of a virtualization engine. It may not require complex analytics, neural nets, or machine learning algorithms to tie the data sets back together later.

In fact, fixing these rekeying issues, according to one estimate, costs the business seven times the money to “fix” this problem in the warehouse, instead of addressing it in the source applications. Fixing the problem in the data warehouse is one form of technical debt (quote and metrics paraphrased from Nols Ebersohn).

If the source system cannot be rekeyed or the source system cannot add an “intelligent” or “smart key” that is a contextual key, the recommendation is to implement master data management upstream. If MDM cannot be implemented, the next recommendation is leverage the source system business keys (unless there are composite business keys)—in which case, a hash is the base-level default recommendation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset