Chapter 2. Moving Toward Scalable Data Unification

The early users of data management systems performed business data processing—mostly transactions (updates) and queries on the underlying datasets. These early applications enabled analytics on the current state of the enterprise. About two decades ago enterprises began keeping historical transactional data in what came to be called data warehouses. Such systems enabled the use of analytics to find trends over time; for example, pet rocks are out and Barbie dolls are in. Every large enterprise now has a data warehouse, on which business analysts run queries to find useful information.

The concept has been so successful that enterprises typically now have several-to-many analytical data stores. To perform cross-selling, obtain a single view of a customer, or find the best pricing from many supplier data stores, it is necessary to perform data unification across a collection of independently constructed data stores.

This chapter discusses the history of data unification and current issues.

A Brief History of Data Unification Systems

ETL systems were used early on to integrate data stores. Given the required amount of effort by a skilled programmer, ETL systems typically unified only a handful of data stores, fewer than two dozen in most cases. The bottleneck in these systems was the human time required to transform the data into a common format for the destination repository—it was necessary to write “merge rules” to combine the data sources, and additional rules to decide on the true value for each attribute in each entity. Although fine for small operations, like understanding sales and production data at a handful of retail stores or factories, ETL systems failed to scale to large numbers of data stores and/or large numbers of records per store.

The next generation of ETL tools offered increased functionality, such as data cleaning capabilities and adaptors for particular data sources. Like the first generation, these ETL tools were designed for use by computer programmers, who had specialized knowledge. Hence, they did not solve the fundamental scalability bottleneck: the time of a skilled software professional. These ETL tools form the bulk of the unification market today; however, most large enterprises still struggle to curate data from more than a couple dozen sources for any given data unification project. The present state of affairs is an increasing number of data sources that enterprises want to unify and a collection of traditional ETL tools that do not scale. The rest of this book discusses scalability issues in more detail.

Unifying Data

The benefits to unifying data sources are obvious. If a category manager at Airbus wants to get the best terms for a part that their line of business (LoB) is buying, that manager will typically have access only to purchasing data from their own LoB. The ability to see what other LoBs are paying for a given part can help that category manager optimize their spend. Added up across all of the parts and suppliers across all Airbus LoBs, these insights represent significant savings. However, that requires integrating the LoB supplier databases for each LoB. For example, GE has 75 of them, and many large enterprises have several to many of them because every acquisition comes with its own legacy purchasing system. Hence, data unification must be performed at scale, and ETL systems are not up to the challenge.

The best approach to integrating two data sources of 20 records each is probably a whiteboard or paper and pencil. The best approach for integrating 20 data sources of 20,000 records each might very well be an ETL system and rules-based integration approach. However, if GE wants to unify 75 data sources with 10 million total records, neither approach is likely to be successful. A more scalable strategy is required.

Unfortunately, enterprises are typically operating at a large scale, with orders of magnitude more data than ETL tools can manage. Everything from accounting software to factory applications are producing data that yields valuable operational insight to analysts working to improve enterprise efficiency. The easy availability and value of data sources on the web compounds the scalability challenge.

Moreover, enterprises are not static. For example, even if Airbus had unified all of its purchasing data, its acquisition of Bombardier adds the data of another enterprise to the unification problem. Scalable data unification systems must accommodate the reality of shifting data environments.

Let’s go over the core requirements for unifying data sources. There are seven required processes:

  1. Extracting data from a data source into a central processing location

  2. Transforming data elements (e.g., WA to Washington)

  3. Cleaning data (e.g., –99 actually means a null value)

  4. Mapping schema to align attributes across source datasets (e.g., your “surname” is my “Last_Name”)

  5. Consolidating entities, or clustering all records thought to represent the same entity (e.g., are Ronald McDonald and R. MacDonald the same clown?)

  6. Selecting the “golden value” for each attribute for each clustered entity

  7. Exporting unified data to a destination repository

Plainly, requirements 2 through 5 are all complicated by scale issues. As the number and variety of data sources grows, the number and variety of required transforms and cleaning routines will increase commensurately, as will the number of attributes and records that need to be processed. Consider, for example, names for a given attribute, phone number, as shown here:

Source Attribute name Record format
CRM-1 Tel. (xxx) xxx-xxxx
CRM-2 Phone_Number Xxxxxxxxx
DataLake Phone_Number xxx-xxx-xxxx

To consolidate the two CRM sources into the DataLake schema, you need to write one mapping: Phone_Number equals Telephone. To standardize the format of the number, you need to transform two different formats to a third standard one.

Now let’s do this for six data sources:

Source Attribute name Record format
CRM-1 Tel. (xxx) xxx-xxxx
CRM-2 Phone_Number Xxxxxxxxx
Excel-1 Phone (xxx) xxx-xxxx
Excel-2 Telephone Number xxx.xxx.xxxx
POS-1 Cell Xxx xxx xxxx
DataLake Phone_Number xxx-xxx-xxxx

We now have five different names for the same attributes, and one of these attributes (Cell) might require some expertise to correctly map it to Phone_Number. We also have four formats for phone numbers, requiring four different transformations into the DataLake format. In this simple example, we’ve gone from three rules to unify three data sources to eight when doubling the amount of attributes. Hence, the complexity of the problem is increasing much faster than the number of data sources. Rules are problematic at scale because:

  • They are difficult to construct.

  • After a few hundred, they surpass the ability of a human to understand them.

  • At scale, they outstrip the ability of humans to verify them.

The first and second generations of ETL systems relied on rules. Creating and maintaining rules, in additional to the verification of the results of those rules, constitutes the bulk of the human time required for rules-based ETL approaches. This is an example of why traditional ETL solutions do not scale. Any scalable data unification must obey the tenets discussed in the next section.

Rules for Scalable Data Unification

A scalable approach therefore must perform the vast majority of its operations automatically (tenet 1). Suppose that it would take Airbus 10 years of labor to integrate all of its purchasing systems using a traditional, rules-based approach. If we could achieve 95% automation, it would reduce the time scale of the problem to six months. Automation, in this case, would use statistics and machine learning to make automatic decisions wherever possible, and involve a human only when automatic decisions are not possible. In effect, we must reverse the traditional ETL architecture, whereby a human controls the processing, into one in which a computer runs the process using human help when necessary.

For many organizations, the large number of data sources translates into a substantial number of attributes; thousands of data sources can mean tens or hundreds of thousands of attributes. We know from experience that defining a global schema upfront, although tempting, inevitably fails, because these schemas are invalid as soon as requirements change or new data sources are added. Scalable data unification systems should be discovered from the source attributes themselves rather than defined first. Therefore, scalable data unification must be schema-last (tenet 2).

As mentioned earlier, ETL systems require computer programmers to do the majority of the work. Business experts are sometimes involved in specifying requirements, but the people who build and maintain the data architecture are also responsible for interpreting the data they are working with. This requires, for example, a data architect to know whether “Merck KGaA” is the same customer as “Merck and Co.” Obviously, this requires a business expert. As a result, scalable data unification systems must be collaborative and use domain experts to resolve ambiguity, thereby assisting the computer professionals who run the unification pipeline (tenet 3).

Taken together, these three tenets lead us to a fourth one, which is rules-based systems will not scale, given the limitations outlined earlier. Only machine learning can scale to the problem sizes found in large enterprises (tenet 4).

However, machine learning–based solutions do have some operational complexities to consider. Although a human can look at a set of records and instantly decide that they correspond to a single entity, data unification systems must do so automatically. Conventional wisdom is to cluster records into a multidimensional space formed by the records’ attributes, with a heuristically specified distance function. Records that are close together in this space are probably the same entity. This runs into the classic N**2 clustering problem, and the computational resource required to do operations with complexity N**2, where N is the number of records, is often too great. Scalable unification systems must scale out to multiple cores and processors (tenet 5) and must have a parallel algorithm with lower complexity than N**2 (tenet 6).

Given the realities of the enterprise data ecosystem, scalable unification systems need to accommodate data sources that change regularly. Even though running the entire workflow on all of the data to incorporate changes to a data source can satisfy some business use cases, applications with tighter latency requirements demand a scalable unification system to examine the changed records themselves and perform incremental unification (tenet 7).

Scalable data unification must be the goal of any enterprise, and that will not be accomplished using traditional ETL systems. It is obviously the foundational task for enterprises looking to gain “business intelligence gold” from across the enormous troughs of enterprise data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset