Foreword by Tom Davenport

Thomas H. Davenport
Distinguished Professor, Babson College
Research Fellow, MIT Initiative on the Digital Economy
Senior Advisor, Deloitte Analytics and Cognitive Practices
Member of Tamr’s Board of Advisors

My focus for the last several decades has been on how organizations get value from their data through analytics and artificial intelligence. But the dirty little secret of analytics and AI is that the people who do this work—many of them highly skilled in quantitative and technical fields—spend most of their time wrestling with dirty, poorly integrated data. They end up trying to fix the data by a variety of labor-intensive means, from writing special programs to using “global replace” functions in text editors. They don’t like doing this type of work, and it greatly diminishes their productivity as quantitative analysts or data scientists. Who knows how much they could accomplish if they could actually spend their time analyzing data?

This is particularly true within large companies and organizations, where data environments are especially problematic. They may have the resources to expend on data engineering, but their problems are often severe. Many have accumulated multiple systems and databases through business unit autonomy, mergers and acquisitions, and poor data management. For example, I recently worked with a large manufacturer that had over 200 instances of an ERP system. That means over 200 sources of key data on critical business entities like customers, products, and suppliers. Even where there are greater levels of data integration within companies, it is hardly unusual to find many versions of these data elements. I have heard the term “multiple versions of the truth” mentioned in almost every company of any size I have ever worked with.

Addressing this problem has been prohibitive in terms of time and expense thus far. As pointed out later in this report, companies have primarily attempted to solve it through the collection of techniques known as “master data management,” or MDM. One objective of MDM is to unite disparate data sources to achieve a single view of a critical business entity. But the ability to accomplish this objective is often limited.

As this report will describe, rule engines are one approach to uniting data sources. Most vendors of MDM technology offer them as a key component to their technology. But just as in other areas of business, rule engines don’t scale well. This 1980s technology has some virtues—rules are easy to construct and are often interpretable by amateurs. However, dealing with large amounts of data and a variety of disparate systems—attributes of multiple-source data in large organizations—are not among those virtues. In this as in most aspects of enterprise artificial intelligence, rule engines have been superseded by other technologies like machine learning.

Machine learning is, of course, a set of statistical approaches to using data to teach models how to predict or categorize. It has proven remarkably powerful in accomplishing a wide variety of analytical objectives—from predicting the likelihood that a customer will buy a specific product, to identifying potentially fraudulent credit transactions in real time, and even to identifying photos on the internet. Much of the enthusiasm in the current rebirth of artificial intelligence is being fueled by machine learning. It’s great that we can now apply this powerful tool to one of our most persistent problems—inconsistent, overlapping data across an organization.

Unifying diverse data may not be one of the most exciting applications of machine learning, but it is one of the most beneficial and financially valuable. The technology allows systems like Tamr’s to identify “probabilistic matches” of multiple data records that are likely to be the same entity, even if they have slightly different attributes. This recent development makes a very labor intensive and expensive data mastering initiative into one that is much faster and more feasible. Projects that would have taken years without machine learning can be done in a few months.

Of course, as with other applications of AI, there is still some occasional need for human intervention in the process. If the probability of a match is below a certain level, the system can refer the doubtful data records to a human expert using workflow technology. But it’s far better for those experts to deal with a small subset of weak matches than an entire dataset of them.

The benefits of this activity can be enormous. How valuable is it, for example, to avoid bothering a customer with multiple marketing messages, or to be able to focus marketing and sales activities on an organization’s best customers with speed and clarity? How important is it to know that many different functions and business units within your company are buying from the same supplier? And would it be useful to know that you have more than you need in inventory of an expensive component of your products? All of these business benefits are possible with agile data mastering fueled by machine learning. And a side benefit is that the employees of your organization won’t have to spend countless hours trying to figure out whose data is correct or creating a limitless number of rules.

Even with this powerful technology, it still requires resolve, effort, and resources to unify and master your data. And after you’ve done it successfully, you still need effective governance to limit ongoing proliferation of key data. But now it is a reasonable proposition to think about a set of “golden records” that can provide long-term benefits for your organization. One version of the truth is in sight, and that is an enormously valuable business resource.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset