CHAPTER 2: What’s in it for me?

Bad data is an equal opportunity peril, adding cost, making it more difficult to run things, and compromising decisions. It angers customers, adds risk that those interesting insights discovered by an analytics effort aren’t so, and generally stymies any effort that depends on data. Still, most people must develop a deeper understanding of how they, their work teams, departments, and companies are directly affected before they get involved. This chapter helps do just that, outlining exercises to baseline current quality levels, estimate costs, and synthesize an overall motivation or justification. Just as important, the chapter explores why so many people attack data quality the wrong way, operating what I call “hidden data factories,” rather than getting in front.

Over the years there have been many high-level studies of the costs associated with bad data. Our work at Bell Labs in the 1990s led me to conclude that about half of the cost in a service department was spent accommodating bad data. A 2002 TDWI study put the cost of bad customer data alone at $600B/year; and a 2016 IBM graphic puts the cost at $3.1T/year in the U.S., which is about 20 percent of the Gross Domestic Product. Finally, bad data is behind the scenes in stories of national and international importance every day. While the examples and numbers stun, few people take them as a call to arms because they don’t see how they are personally impacted. Hence my focus here on “what’s in it for me.”8

Instructions:

  1. Estimate current data quality levels, using the Friday Afternoon Measurement.
  2. Recognize your hidden data factories for what they are—poor, expensive alternatives to getting in front on data quality.
  3. Estimate the non-value-added costs associated with these hidden data factories.
  4. Identify hard-to-quantify costs of special importance.
  5. Think through long-term implications.
  6. Assemble the motivation/justification/business case for getting in front.

Do I Have a Data Quality Problem?

Use the Friday Afternoon Measurement to estimate your error rate

Invest a few hours to make a Friday Afternoon Measurement (FAM)9 and answer the question “Do I have a data quality problem?” Figure 2.1, below, presents the protocol for doing so.

Figure 2.1 Protocol (i.e., steps) for the Friday Afternoon Measurement.

First, assemble a spreadsheet that looks much like Figure 2.2. Take the last 100 data records that you used, created, or processed, limiting yourself to the 10 to 15 most essential attributes (or fields, columns, or elements).

Figure 2.2 Step 1 of the Friday Afternoon Measurement Protocol.

Step 2 is to gather two to three people who understand the data intimately. Invite them to a two-hour meeting. Step 3 calls for those experts to mark the obviously erred data in red (the dark shaded cells), producing a spreadsheet like that in Figure 2.3.

Figure 2.3 Step 3 of the Friday Afternoon Measurement Protocol.

Step 4 is to summarize and interpret the results. To do so, rate each record as “perfect” if there is no red in its row and “not perfect” otherwise. Next, count the errors for each attribute and total the number of perfect records. The spreadsheet in Figure 2.4 shows the results.

The Friday Afternoon Measurement works because it is usually easy for those well-versed in the nuances of the data to spot most errors.

Figure 2.4 Step 4 of the Friday Afternoon Measurement Protocol.

In this case, the data “passed” 67 times out of 100. Said differently, Data Quality (DQ) =.67 (indicated in the lower right shaded cell in the spreadsheet). This means that a full one-third of the records you need are not fit for use. You do indeed have a problem and almost certainly a big one.

There are many subtleties built into the step-by-step instructions for the FAM. One involves requiring that all attributes be correct. It stems from the observation that if even one attribute is erred, the customer can’t use the data record without correction.

To calibrate, I’ve seen FAM and other initial results as low as DQ =.08 and as high as DQ =.95. In a recent Dun and Bradstreet report on business-to-business marketing data quality, scores came in between DQ = .13 and .23 (though these numbers are not fully comparable).10 For FAM, most score between .30 and .80. So DQ = .67 is on the high end of typical.

Almost everyone is surprised, even shocked, by their low scores. And this emotional energy can help turn them into provocateurs!

A second FAM subtlety involves taking the last 100 data records, which aims to eliminate a tough inferential issue. In reporting results, don’t try to generalize. Simply state that “in the last 100 records, 33 failed.”

Conversely, I’ve never heard anyone say, “Excellent. I feared things were much worse!”

The Wrong Reaction in the Face of Bad Data

Almost everyone, from the shipping clerk trying to get a package to the right address, to the mid-level executive trying to manage his budget, to the data scientist trying to make sure her analyses don’t go awry, to the senior leader trying to set strategic direction, do their best to deal with bad data. Many, almost unconsciously, go to extraordinary lengths in doing so. But are they doing the right things?

Consider Samantha, a rising star executive, preparing for her first meeting with the Board. While reviewing her presentation, she and her assistant Steve notice something that looks strange in the sales numbers from the Widget Department. After some discussion, she asks him to research the numbers. An hour later he’s found the problem, explained the corrected number to her, and updated the presentation.

And off Samantha goes. She connects with the Board, and there is a good discussion around the very number that her assistant corrected. She returns to her office elated. She thanks Steve, gives him an on-the-spot bonus of $200, and tells him to take the rest of the day off.

As he’s leaving, she remarks, “You know, you should check the Widget Department’s numbers every month. We were lucky this time. I don’t want to risk it ever again!” Steve agrees and heads out the door.

The Hidden Data Factory

It is easy to cheer their dedication, hard work, and fortune. But let’s take a deeper look. In deciding to check the Widget Department’s numbers on an ongoing basis, Samantha and Steve have set up what I call a “hidden data factory.” They conduct extra work to search for and correct or otherwise accommodate bad data.

Note that Samantha didn’t attempt to get in front of the issue. She could have called her peer, the head of the Widget Department, to advise him of the issue. She could have simply explained her requirements. She could have offered to lead an improvement project to get to the bottom of the issue. She could have taken steps to correct the corporate numbers, rather than leaving others to be victimized. Subtly perhaps, in setting up her own little hidden data factory, she has assumed responsibility for the quality of Widget Department numbers, even though she doesn’t know the first thing about widgets. Who even knows if the number she presented to the Board was correct? And because the underlying business process is unchanged, the widget people continue to pump out more bad data, so the one-man hidden data factory (Steve) is doomed to go on forever.

One could also argue that our rising star should be shown the door. Leaving others within one’s company to be victimized is the height of managerial irresponsibility and is not acceptable. Ever.
Whatever side you take, the vignette clearly illustrates a cultural aspect of data quality.

You can’t blame Samantha for correcting numbers in advance of her important meeting. After all, it would be simply irresponsible to present bad numbers to the Board. At the same time, she elected, perhaps without thinking, not to get in front of future issues. Unfortunately, she is not alone. It is exactly this dynamic which leads to hidden data factories all over the company!

When you bust open a typical process, it looks something like Figure 2.5. Steps 1, 3, and 4 are the value-adding work–the process’s raison d’être. Steps 2 and 5 constitute the hidden factory.

Figure 2.5 A typical business process features value-added work (Steps 1, 3, and 4) and non-value-added work (Steps 2 and 5) solely to address data quality issues.

Hidden data factories abound at all levels and in all areas. They attest to the great value people, at all levels and in all departments, place on high-quality data. Many billing departments conceal hidden data factories. They can proliferate between departments, as sales wastes time dealing with error-filled prospect data received from marketing, and operations deals with flawed customer orders received from sales.

There’s usually an enormous hidden data factory in IT to resolve discrepancies between the various HR, finance, inventory management, production, and other systems because key data definitions don’t align. I’ve worked with companies that purchased the same data from several sources, compared the various versions, labeled the one they liked best the “golden copy,” and used it going forward. Imagine being so rich that you could purchase several copies of anything and throw all but one away!

Hidden data factories abound in knowledge work as well. A geologist can’t find the survey he needs and so orders another. Analysts spend inordinate time checking the facts before they do their analyses. When a finance manager can’t square results from two sources, she has to figure out which (if either) is right. People from two departments may fight over whose numbers are better, which produces an oft-combustible data factory. And data scientists complain that they spend more time cleaning up data than analyzing it.

There are hidden data factories at the executive level as well. I’ve already noted the one set up by Samantha. In the same vein, many senior managers don’t trust the numbers coming from the financial system and so maintain their own records. And some executives hedge monthly numbers in case something is missing.

Better, faster, cheaper

The great Dr. Armand Feigenbaum11 coined the term “hidden factory” during the quality revolution in manufacturing to describe work conducted by one person or group to accommodate the errors made by another. Hidden factories were shown as responsible for up to 40 percent of a manufacturing plant’s costs and they led to other problems as well.

So too with hidden data factories:

  • They add time. Identifying suspect data is time-consuming enough. Correcting it takes even longer.
  • They add cost. I’ll provide a simple calculator that provides a rough estimate just below.
  • They don’t work very well. I’ve already noted that finding and fixing errors is tough work and, even under the best of circumstances, too many errors leak through. Indeed, most hidden data factories are run in an ad hoc manner and under enormous time pressure.
  • They provide a false sense of security, as the Samantha vignette attests.

Poor alternatives to getting in front indeed!

Use the Rule of Ten to Estimate Costs

Bad data does lots of damage, but it is incredibly difficult to pin down most of the associated costs. For example, no one knows how to calculate the cost of a bad decision. Even the most basic costs, those associated with the hidden data factory, are difficult to determine. After all, accounting systems are not designed to track them (one reason the factory remains hidden). Still, the so-called “Rule of Ten”12 yields a quick and often good-enough estimate. It reflects the high costs associated with the non-value-adding steps of Figure 2.5 and states:

It costs ten times as much to complete a unit of work when the data is erred as it does when it is perfect.

Now, suppose your work team must complete 100 units of work in a given period of time, and it costs a dollar to complete each unit when the data is perfect. Under these assumptions, the cost of the value-adding work is $100 (a dollar for each unit). The total cost in the face of errors and the cost associated with non-value-adding work depends on the fraction of those errors. If DQ = .67, as obtained from the FAM, then:

The conclusion is as follows: “For every dollar we spend on value-added data work, we spend three performing non-value-added work so we can use the data!”

While this estimate can’t be defended in any scientific sense, it can help you start a conversation. Your colleagues may propose a multiplier other than ten that they consider more appropriate. If so, simply redo the calculation. I’ve never heard anyone propose a multiplier less than five. Even at that level, the cost of the hidden data factory proves enormous. In the example above, the cost of non-value-added work is $132, still greater than the cost of the value-added work.

Identify “Hard-to-Quantify” Costs of Special Importance

I wish to re-emphasize that the costs associated with data factories do not capture all of the damage that stems from bad data. Indeed, it doesn’t even reflect all of the added costs. Quite frequently other things, such as the added difficulties in running your department or company, angered customers, and the lack of trust in an analytic insight, are far more important!

Thus, look at the damage more holistically. Use the Cost of Poor Data Quality (CoPDQ) checklist, in Figure 2.6, simply noting which are of equal or greater importance than cost. The CoPDQ checklist lists frequent “hard-to-quantify” costs associated with bad data. In many cases, these are of greater importance than the added costs.

Figure 2.6 Other common costs associated with bad data.

Think Longer-Term

A few people in each department and company should also think about data quality longer-term. Despite the flurry of media attention data is receiving, not enough people are worrying specifically about quality. In particular, most companies are already, quite literally, competing with data. Not only do they use data internally, they expose enormous quantities of data to their prospects and customers (think product descriptions); to their suppliers (think specifications) and competitors; to financial markets and regulators (think annual reports); and so forth. Some of that data is bad and sometimes prospects, customers, and financial markets take their business elsewhere. No one who’s placed an online order for in-store pick-up only to be told at the pick-up counter, “Nope, we haven’t had any of those for days,” ever completely forgives the frustration.

And data is only growing more important. In my last book, Data Driven, I enumerated a long list of ways to “put data to work” and summarized each one. Based on research since then, I now recognize four basic strategies13 (with dozens of variants) for competing with data. And bad data stymies all.

  1. Be data-driven.14 The essence of this strategy is that everyone who comes to the decision-making table–whether they’re alone at the table or surrounded by dozens of others–brings more and more data along, with the result that people combine the data with their intuitive business sense and thereby make better decisions.

    But you can’t expect people to bring data they don’t trust to the decision-making table.

  2. Use big data and advanced analytics to innovate. This is a strategy of finding clues in the data and leveraging them to create better products and processes. The promise is enormous. Big data is all the rage right now, but deeper analysis of smaller data will likely prove just as effective.

    But bad data can send even the cleverest algorithm astray. Further, even if you find something interesting in the (dirty) data, it is more difficult to take full advantage.

  3. Content is king. Companies gain an edge by providing new and better content to customers. There are many variations of this strategy, from selling content directly (think Morningstar), infomediating (Google), and informationalization (Uber).

    But exposing bad data in marketplaces is risky. For example, it is hard to imagine much success for Uber if rides don’t show up when promised.

  4. Become the low-cost data provider. Scaling up the direct costs described above, it is not hard to see that bad data may account for half of operational expenses and a significant fraction of other expenses. Companies seeking to compete by offering the lowest prices may find improving data quality is the best way to lower their cost structures.

    Quite obviously you can’t do so without a singular focus on data quality.

So What’s in It for Me?

The discussion above helps quickly and powerfully baseline the current state. The promise of getting in front involves reducing the error rate by an order of magnitude and taking advantage. Assemble a “before and after” picture as your motivation/justification/business case for data quality, such as the example in Table 2.1.

Table 2.1 An example justification/motivation for a department-level data quality effort.

Current State

Desired State

DQ Level

Customer Data: DQ = 43 percent

DQ = 95 percent

Technical Data: DQ = 57 percent

DQ = 96 percent

Media Data: DQ = 72 percent

DQ = 97 percent

Cost

Ops team spends two-thirds of its time on DQ

Get this to 20 percent (will require some investment)

Data scientist team spends 75 percent of its time on DQ

Get this to 20 percent (will require some investment)

Other Near-Term Impacts

Too many anecdotal decisions in running department

Greater confidence in day-in, day-out decisions

Data scientists are tough to find and bad data contributes to dissatisfaction

Reduce turnover by 50 percent

Long-Term Impact

We don’t have the needed base of support to leverage analytics

Discussions focus on our analyses, not problems in the data

In Summary

Virtually everyone needs high-quality data and they know it, but are blind to how good or bad it is and its impact. You need to make the current state visible and you need to clarify the problems you wish to resolve. There is no great mystery.

You also need to accept that in setting up a hidden data factory you have become part of the problem. Recognize this constitutes non-value-added work. To reduce it, you must get in front on data quality.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset