CHAPTER 1: Data Quality: Why, How, Who, and When?

Why?

Recently a media executive asked me to put data quality in everyday language. “Did you trust the data you used to make your most recent decision of real importance?” I asked him.

“No,” he replied. “I did not. In fact, I can’t recall a time when I fully trusted the data. I view it as my job to make the best call, even recognizing there are problems with the quality.”

He’s not alone. In a recent survey conducted by Harvard Business Review, only 16 percent of managers expressed strong confidence in the accuracy of the data underlying their business decisions.2

Further, the media executive’s comments betray a simple reality for almost everyone: they must do their work in spite of bad data. A geologist in an oil company can’t trust reports about existing wells; a salesperson can’t trust the contact leads she has been given; a back office worker in a financial services company can detect details about a municipal bond that are inaccurate; a shipping clerk spends more time dealing with returned parcels than sending out new ones. It’s a problem for individuals and work teams, up and down the organization; for departments and entire companies; in the private sector and the public sector.

Quite naturally, all do their best in the face of bad data—correcting errors, searching for other sources, sorting out what the data means, and otherwise working around the inaccuracies. But they don’t always succeed and bad data goes on to the next person, process, or application.

Many errors cause only minor problems—a package is sent to the wrong address, an interest payment is too big, an employee gets more than his allotted vacation, and so on. A few lead to “train wrecks,” such as the aforementioned Martian lander or death following the transplant of an organ carrying the wrong blood type. Indeed, examples of bad data and its impact make national and international news with disturbing frequency.3

The damage doesn’t stop there. Bad decisions are made based on bad data. Customers get angry, bosses frustrated, and internal rancor is sewn. One of the biggest political fights I’ve witnessed involved “whose data was better” and the disputing parties didn’t speak to each other for months after.

Finally, bad data is like a virus. There is no telling where it will show up or the damage it will do.4 Falsified data on mortgage applications in the run-up to the financial crisis provides a perfect example. The immediate consequence was that people took out mortgages they couldn’t pay off. Bad as this was, the damage could have been contained; but these mortgages were repackaged into more complex financial products, amplifying the risk and eventually failing in a spectacular way.

Thus, the first and most immediate reason to improve data quality is to improve current business performance.

Now let’s look ahead. It is now trite to observe that data already plays vital and growing roles in all companies. Indeed, data dominates entire industries. Health care runs on data. Other than cash transactions, the finance sector consists of nothing but data. Logistics depend on data—indeed, half the value in delivering a shipping container around the world lies in the data.5

Further, many have opined that data is “the new oil.” Just as it took oil to power the machines of the Industrial Age, so it takes data to power tomorrow’s technologies, products, companies, and economies.

For example, some managers labor to create data-driven cultures, essentially bringing more and more data to the decision-making table, and so improving everything little by little. Other companies are using big data and advanced analytics to innovate, and find and leverage insights to make better products and processes. Uber has informationalized the taxi business—just one example of the new business opportunities in data. More and more, data is key to competing effectively.

At the same time, it is clear enough that you can’t compete using bad data: You can’t expect people to become data-driven when they don’t trust the data; even the most sophisticated analysis is no better than the data on which it is based; and exposing bad data in markets is risky. Departments and companies that hope to build a future in data simply must improve. Finally, though not the original objective, many find getting in front on data quality “transformative,” re-arranging the ways people think about their jobs, changing relationships, and helping create that future.

Thus, the second, and possibly more important, reason to improve quality is that you can’t build a future in data without doing so.

How and Who?

Data quality pioneers have made tremendous progress in the last 25 years or so, working out the basic philosophies, approaches, methods, and organizational structures needed to “get in front on data quality.” They essentially created the most important data correctly, the first time, thus doing away with all that added work.

Getting in front works far better than one might reasonably expect. Companies across a range of industries, including telecom, financial services, oil and gas, and data providers, have made and sustained order of magnitude improvements.

The benefits have also proven enormous, again far greater than one might expect. Those whose goal was saving money have reduced expenses by hundreds of millions. Those less concerned with saving money and more concerned with better decision-making, improved customer satisfaction, or more competitive positioning, have achieved these goals.

Until you get the hang of it, getting in front seems odd. It represents new thinking, requires people to take new roles, and features some new management tools (and a few technical ones). But there is no deep mystery.

A database is like a lake

To fully appreciate the common sense of getting in front, consider Figure 1.1. In the analogy, a lake is likened to a database, the lake water to the data, and the stream to a business process that feeds new data into the database. If there is pollution in the stream, it feeds polluted water into the lake and, if people drink from it, they’re going to get sick.

There are three ways to deal with a polluted lake:

  • Focus your efforts on those who get sick–get good at quickly identifying them and develop efficient systems for taking them to the hospital, pumping their stomachs, and supporting their recoveries.
  • Clean the water before people drink it.
  • Find and eliminate the sources of pollutants, so that only clean water will flow into the lake.

Figure 1.1 A database is like a lake.

For data quality, the choices are just as stark:

  • Let those who use the data fend for themselves and deal with the consequences.
  • Find and fix data errors before people use it.
  • Get in front by creating data correctly the first time. This means finding and eliminating the root causes of error.

In part, getting in front works due to two observations:

  1. It is always easier to create data correctly than it is to find errors and correct them.
  2. It usually costs no more to create correct data than incorrect data.

Getting in front depends on data creators, including anyone who fills in a form, populates a database, crafts a management report, or makes a decision, and those responsible for devices such as thermostats and blood monitoring equipment. In this respect, practically everyone is a data creator! Still, it is usually more appropriate to think of processes, work teams, departments, and companies, rather than people, as data creators and customers. This is certainly the case for complex data products, financial reports, and sophisticated analyses.

Connecting data customers and creators

But as powerful as the lake analogy is, it does not point out which data is most important. That’s where data customers come in – their needs drive what’s most important. As used here, a customer is anyone who uses data in any way. Again, practically everyone is a data customer! Finally, it is often most appropriate to view operational, analytic, decision-making, and planning processes as data customers.

Unfortunately most data customers, in the pressures of day-in, day-out work, attempt to find and correct errors (e.g., clean the lake) rather than communicating their needs to data creators. It simply doesn’t occur to them that they should look upstream, clarify their needs, and help the creators do a better job. For their parts, data creators see no reason to do anything different. So they continue to produce bad data.

I wish to be careful here. I am not suggesting that people shouldn’t correct erroneous data before they use it. Using data you know is bad or passing it on is simply irresponsible. At the same time, it is equally irresponsible to continue such corrections, day-in and day-out, without attempting to get in front of the problem.

To get in front, creators and customers must talk. When customers articulate their most important needs, almost all data creators strive to meet those needs and quality improves rapidly.

A watershed moment occurs when people realize that they are both data customers and data creators. And just as they must work with creators in their roles as customers, so too must they work with their customers, the next department, bosses, corporate customers, the public, regulators, and so on, in their roles as creators.

Of course, life is more complicated than this. While these basic roles are seemingly obvious, they don’t appear out of thin air. A data customer may need data created months before, and three departments removed. The data creator may not know who uses what he or she creates. Both may feel trapped in organizational silos that make communication difficult.

That’s where data quality management comes in. Its main job is to help connect data creators and customers, facilitating direct communications between the two. This may also mean providing measurements to help identify sources of errors, training creators on improvement and control, actively managing change, engaging more senior management, and lending expertise to help solve some particularly nettlesome problems.

Powerful, professional data quality teams should include “embedded data managers,” who are placed in the line, as close to data customers and data creators as possible to help data creators and customers fulfill their responsibilities. Embedded data managers also help data creators and data customers see new opportunities in data and unlock their potential.

Two moments that matter6

The focus on customers and creators and the need for them to connect is so important that I wish to motivate it in a different way. To do so, follow a piece of data around. Only two moments in its entire lifetime truly matter. The first is its moment of creation. After being generated, often in the blink of an eye, it may be moved from place to place, stored in databases, combined with other data, transformed slightly and stored in a data warehouse. Most of the time, it just sits there, leading an incredibly boring existence.

If the data is lucky, it experiences another moment that matters: Someone uses it– to complete an operation, as part of an analysis, to make a decision, to craft a plan, or to do something else. Quality is determined at that moment – the data is either fit for use, relevant, accurate, and well-defined, according to the customer’s needs, or it is not. Note though, that how well the moment of use goes is set at the moment of creation. Created properly and the moment of use should go well. Created improperly and the moment of use will not go well.

Of course the vast majority of data never experiences that moment of use. At most a few percent of data qualify as “most important.”

Thus, from a quality perspective, the only two moments that truly matter are the moment of creation and the moment of use. It follows that we must direct as much attention as they can to these moments—that customers and creators must do all they can to ensure those moments go well! It also follows that we must connect these moments. While customers and creators sometimes find each other, active, engaged data quality management ensures it happens at scale, for all of the most important data.

Getting started

While the roles for data creators, data customers, and quality managers are simple, powerful, and almost obvious, they also require a huge change in thinking and approach. Most creators have no idea that their data causes such pain. Customers tend to fend for themselves, and quality management is sidelined and ignored.

It takes a special spark to transition from the status quo to getting in front. That spark occurs when someone becomes so dissatisfied they scream to themselves, “There has to be a better way!” They display the courage to give getting in front a try. I call these people provocateurs and they usually aren’t data experts. They may be executives, middle managers, data scientists or industry specialists, clerks, even entry-level employees. The only requirements are an unwillingness to accept bad data and the courage to try something new.

To be clear, provocateurs are not usually motivated by data quality per se. Rather, they want to solve a business problem and bad data is getting in the way. Thus improved data is the means to achieve an objective, not the objective itself.

Spurred on by their frustration, provocateurs typically have an enormous impact within their immediate spans of control. But they cannot change entire companies on their own. If the good efforts are to go further, more senior management (dare I say leadership) must take over.

People frequently complain to me that their senior managers “don’t get data quality.” My own experience is somewhat different. True, they don’t think much about data quality, but after just a few minutes of discussion, they get it just fine. The larger problem is that they don’t know what they personally should do about it.

As I’ll make clear, the organization’s top leaders must insist that their departments and companies adopt the getting in front approach, build the roles described here into the structure of their organizations, focus the effort, and advance a culture that values data.

Let tech do tech!

I haven’t said a word, until now, about technologists. That’s deliberate. Too many data programs make the mistake of putting technology first. Although technologists can do many wondrous things in supporting data quality, scaling up data programs, and bringing down costs, they have proven to be ineffective leaders of data quality efforts.

Tech cannot set objectives, select and build support for the right approach, put the right people in place, or sort out what is most important. Nor can it save a data quality program that ignores these factors.

Instead, I find it extremely helpful to approach business problems and opportunities in the following left-to-right fashion, with tech last:7

1. Goal/Approach ⇒ 2. Organization ⇒ 3. Process ⇒ 4. Technology

This aligns with the way corporate leaders think about the business. They realize they bear ultimate responsibility for results and know they have to have the right people, structure, governance, and culture in place to achieve them. Interestingly, positioning tech last also increases its chance of success. Rather than selling an initiative, tech focuses on what it does best—using technology to increase scale and decrease unit cost.

When?

To oversimplify a bit, a data quality program typically unfolds in three phases (refer to Figure 1.2):

  • Getting traction, during which a provocateur tries the get in front of data approach as a means to solve a specific problem;
  • Achieving real results, in which data creators, data customers, quality managers, and embedded data managers work in concert to make significant improvements and put in place a sustaining data quality program on a small scale; and
  • Going to the next level, in which the data quality program, rather than plateauing (as many do), grows beyond the provocateur’s reach, to an entire department or the company as a whole. Senior leadership is essential.

Transitioning from a “provocateur-led” to a “senior management-led” data program is a big deal. Just as data creators and customers had to learn new roles, so too does senior management. Teaching them falls to either provocateurs or (more usually) data quality management. Since data quality programs go as far and as fast as demanded by the senior leader perceived to be leading the effort, I cannot emphasize this enough!

Figure 1.2 A provocateur stimulates a data quality effort, gaining traction and achieving real results. But the effort plateaus and leadership is needed to move the program “to the next level.”

In Summary

Getting in front on data quality means creating the most important data correctly, the first time. For this to happen, people (and processes) must step up to their responsibilities as data creators and data customers. When both make reasonable efforts to connect, to focus on customers’ most important needs, and to find and eliminate root causes of error, data quality improves rapidly. Data creators and data customers thus play the most important roles. And all of us play these roles, every day!

Data quality management helps make creators and customers effective, by helping them connect, supporting embedded data managers, and being the day-in, day-out face of the effort. Technologists help make others more efficient. Once the basic processes work, automation increases scale and decreases unit cost.

Getting started is tough. It takes a special person, whom I call a provocateur, to challenge the status quo and give the give getting in front approach a try. And not long thereafter, senior leaders must take the mantle and extend the effort to everyone!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset