CHAPTER 4, Part I: Must-Dos for All Data Creators

In Chapter 3, I presented two definitions of data quality, each emphasizing the moment of use. My third definition emphasizes the moment of creation:

Meeting the most important needs of the most important customers.21

I call this my day-in, day-out definition, and it serves as a beacon for all data creators.

Data creators include people—after all people enter data and, in so doing, fit any reasonable definition of data creator. That goes for decision makers and planners as well. Their decisions and plans can be likened to data that guide the organization going forward. Data scientists, who tease out insights from large quantities of data, certainly count as data creators.

A measurement device, such as a weather station, a thermometer, a Fitbit, or any other connected tool, also qualifies. A computer application that copies a data item does not qualify. But one that takes $US PROFIT converts it to €US PROFIT and adds it to €EUROZONE PROFIT to produce €TOTAL PROFIT surely qualifies. Since a data customer cannot discuss his or her requirements with a device or bit of code, I find it more appropriate to view the person responsible for such as a data creator also.

Finally, I find it powerful to embed the act of data creation within a process and jointly consider both the process and its manager to be a data creator.

There are four must-do tasks for data creators: understanding customers and their needs, measuring against those needs, conducting improvement projects to close gaps, and implementing controls to keep errors from coming back. Data creators must do these things well and they are the subjects of the first part of this chapter.

The process management cycle helps data creators complete these must-dos and provides many other advantages. It is also my preferred means to unite the roles of data creator and data customer. But not all companies adopt process management in a formal way. Still, anyone can, and in my view should, use the process management cycle to full advantage. I consider these topics so important that I’ve reserved the second part of this chapter to cover it.

A section at the end of Chapter 4, Part II provides special instructions for the creation of data definitions.

Instructions:

  1. Recognize that you are a data creator and that the data you create impacts others.
  2. Focus on the most important needs of your most important customers. Spread the Voice of the Customer far and wide. Quit doing things that customers don’t care about.
  3. Measure data quality levels as close to the points of data creation as possible.
  4. Complete improvement projects, identifying and eliminating root causes of error, to close the gaps between customer needs and measured quality levels.
  5. Implement a multi-layer program of control, in part to reduce the costs associated with the hidden data factory.
  6. Look for opportunities to innovate.

Recognize You Are a Data Creator and That Your Work Impacts Others

Just as everyone must recognize their roles as data customers, they must also recognize their roles as data creators, with responsibilities to those who use that data. It is a powerful insight. Also one that, after a moment’s reflection, is quite obvious.

It bears mention that data creators are also data customers and must work with their data suppliers to ensure they have the high-quality inputs they need to meet their customers’ expectations.

So who counts as a data customer? Certainly all of the following:

  • Bosses, including anyone in your management chain.
  • You are a customer for some of the data you create.
  • Internal customers, others within the company but not in your management chain, who use the data you create or supply.
  • External company customers, who have a business relationship with your company, but don’t explicitly pay for the data they receive. A bank’s customers don’t explicitly pay for statements, for example.
  • Paying customers, who pay for the data in question. These customers are especially important to commercial data providers such as Morningstar and TeleTech.
  • Shareholders.
  • Regulators.
  • Others who may use the data you provide, even though there is no formal business relationship. People who use government statistics and other data are customers of the department creating them, for example.

I suppose one could cite computers or application programs as customers, especially since some data feeds directly into them. But, so far, anyway, you can’t talk to a computer about what’s most important. View the person or work team responsible for the application as the customer instead.

You need to recognize that your work impacts customers, even if they are outside your reporting structure. Bad data may cause them extra work, result in a bad decision, or cost your company a customer. You don’t work in isolation. Respect these people and their needs.

Focus on the Most Important Needs of the Most Important Customers

In the previous chapter, I described how a customer should document his, her, or its needs and communicate them to data creators. From the creator’s perspective, the essence of that advice is exactly the same, but with the added discomfort of sorting out which customers are most important. Figure 4.1 captures this in a “step 0,” added to the customer needs analysis process.

Figure 4.1 The customer needs analysis process (creator version).

“Naming your most important customers” can be especially difficult because telling someone that another customer is more important can be uncomfortable. What should one do, for example, when the boss wants one thing and someone in a different department wants something else, even if the boss recognizes that department as a customer? There is no easy answer.

With some rather obvious modifications, steps 1 through 4 of the customer needs process unfold much the same whether the data customer or creator is leading the exercise. The result is a Voice of the Customer document. The last part of this instruction is spread the VoC far and wide. In contrast, the instruction for customers calls for a more targeted communication to critical data suppliers. Getting everyone involved can lead to immediate improvements, as when somebody notices, “Ah, I see why that is important. Until now I had simply put in 999-NA because someone taught me years ago that the system allowed it. Now that I know this is important, I can do much better.”

As noted earlier, data creators failing to understand customer needs have been the major, if not dominant contributor to every data quality issue I’ve worked on for nearly 30 years. Fortunately, it is the easiest issue to resolve—no great technological magic is required. Unfortunately, I find many data creators reluctant to take this step—they feel it is tantamount to publicly admitting you don’t know what the customer wants (in private conversations, many readily admit they don’t understand customers or their needs. Their issue is admitting it publicly). My advice is simple, “Get over it!”

Quit doing work for which you have no customer

Most people and work teams perform lots of unimportant work. For example, a market research group I worked with developed and sent out over 160 reports every week. This production effort occupied most of their time, leaving them little time to actually research the markets. So they rank-ordered these reports as best they could, based on whom they were sent to and the last time they fielded a question about them.

I’ve used the term “non-value-added work” in connection with the hidden data factory, as work that no informed customer would pay extra for. You may still have to do this work. For example, you must correct an erroneous address for a customer before sending it out. Here I’m using the term “unimportant work,” and it certainly qualifies as non-value-added. But it is distinct in that you don’t have to do it. No one cares. There is no customer for “unimportant work”!

Next, they quit sending half, expecting a firestorm of complaints. But nary a peep! Ever.

So after a few weeks, they did it again, taking a bit more care rank-ordering the remaining 80 or so reports and again not sending out half. This time, they heard from one customer about one missing report. So they reinstated it. And they stopped the exercise. But note that they had eliminated three-quarters of their production work, freeing up valuable time for actual market research.

I find this again and again. People and work teams expend enormous effort on tasks for which they have no customer. So to be clear, understanding the needs of your most important customers does dual duty. It enables you to focus on what’s most important and to ruthlessly cut work that is truly unimportant. It is NOT an instruction to cut staff. Rather, an instruction to redeploy staff to the most important work.

Measure Quality Against Those Needs, in the Eyes of the Customers

The next step is to measure the quality of data you, your work team, department or process creates against customer needs. Measurement is simultaneously the most mysterious and technical work in data quality management. Mysterious because data has no physical properties, such as length, viscosity, and color, so there is no physical property to record. Said differently, there is no such thing as an “accurometer.”22 Technical because there are often tough sampling issues to be resolved, complex measurement protocols to work out, and hard choices to be made about ways to report results. I’ve already discussed the Friday Afternoon Measurement as a means to gain traction. The next section looks at the “business rules” method, a popular means of scaling up. In the subsequent section I’ll explore data quality measurements more generally.

Scale up with business rules

The Friday Afternoon Measurement is fit for purpose in that it is fast, cheap, and provides a simple answer to a basic and important question. It is often perfect for getting traction. It does not, however, scale up. Over time, data creators must measure every month, week, day, or an even briefer period of time, depending on the speed of the process. Automated measurements are needed. That’s where DQ Measurement with Business Rules comes in.

In this context, a business rule is nothing more than a constraint on data values. If the data values lie outside the specified domain, they can’t be correct. Some examples of failed business rules and the reason they fail (in this particular case) include:

SUPPLIER NAME = NULL (a required attribute)

SEX = X (Sex = M, F, or NA only permitted)

REVENUE = $10,000; EXPENSE = $8,000; PROFIT = $4,000 (Profit must equal Revenue – Expense)

Thus the basic idea is to automate the checking of a data set against the rules, smacking data created in the most recent time period against the rules, and counting the failed records. Figure 4.2 presents the protocol.

Figure 4.2 Protocol for measuring data accuracy using business rules.

The power of automation allows frequent measurement. Hence the time-series plot, such as in Figure 4.3. It and the Pareto chart (Figure 4.4) are the workhorses of quality management.

Figure 4.3 Time-series plot example.

Figure 4.4 Pareto chart example.

A Broader Look at Data Quality Measurement

The measurement methods introduced so far (Friday Afternoon Measurement, business rules measurements, and the rule-of-ten) are useful, even powerful. But they are not perfect, nor do they cover all situations that come up. Indeed, there are enough situations and measurement options to fill an entire volume. So I wish to conclude this section by providing two further instructions.

Commercial data providers should, in my opinion, publish data quality statistics, using time-series and Pareto plots.

The first follows up on the earlier observation that there is no “accurometer.” Thus, all methods for measuring data accuracy have strengths and weaknesses. The Friday Afternoon Measurement suffers because experts make mistakes and because it doesn’t scale up. Measurements using business rules suffer because meeting a business rule doesn’t imply that the data is correct. For example, for me, SEX = F is incorrect. But it will pass a simple business rule that SEX must be either M or F. Table 4.1 presents a list of measurement devices, a brief summary of how they work, and the pros and cons of each.23

Table 4.1 Candidate data accuracy measurement devices.

Device

Description

Strength

Weakness

Data Tracking

Track a sample of data records across an end-to-end process of data creation

powerful insights, focus on interfaces between steps

expensive

Expert Opinion (i.e., Friday Afternoon Measurement)

Experts identify errors by eye

quick and easy

doesn’t scale

Business Rules

Compare data to business rules to identify “invalid” data

scale, linkage to control

more difficult than it seems

Customer Complaint

Counts errors customers complain about

measurement in the eyes of the customer

customers don’t always (or even often) complain; deceptively hard

Real-World

Compare data to real-world counterparts

best accurometer

very expensive and not always feasible

Surveys

Ask customers

can yield powerful insights

often difficult to set up, administer, and interpret

Use this list to develop more appropriate accuracy measurements as your data quality program matures.

Second, so far I’ve concentrated only on accuracy in the process of data creation. Yet other aspects of data quality may also be important.

Find and Eliminate Root Causes of Error

Eliminating root causes of error is essential if you’re going to create data correctly the first time. Sometimes this is easy: Those who create data may, once they understand customer needs or quality levels (e.g., along the lines of Figures 4.3 and 4.4), act on their own accord to find and eliminate the root causes of error. You should encourage this!

Sometimes eliminating the root cause is more involved and then it is best to employ a structured method. If yours is a Six Sigma company (meaning it regularly employs DMAIC, lean, lean-sigma, or a variant) then use it. If not, follow the Quality Improvement Cycle (QIC),24 depicted in Figure 4.5. QIC was developed at AT&T in the 1980s and has proven itself time and again.

In keeping with the emphasis on “who,” I prefer QIC to Six Sigma because it clarifies the who right up front in step 1. That said, the last thing most organizations need is “dueling improvement methods.” So if your company is comfortable with Six Sigma, by all means use it for data quality.

Except for the first two steps, which involve getting the right people in place, I’m not going to discuss QIC any further.

Step 1 of the QIC calls for the appropriate manager (work team, departmental, or process manager) to select improvement projects. He or she can identify potential projects using customer needs, measurements, perceived gaps, broken interfaces (see next section), or the inputs of others, and base their selection on any number of factors from anticipated benefits (e.g., cost reduction) to progress towards targets to the entreaties of those who work on the process to his or her own preferences to perceived ease or difficulty to political expediency. The point is that the choice of project falls to the responsible manager because he or she bears ultimate responsibility for results.

Figure 4.5 The QIC is a powerful method for identifying and eliminating root causes of data errors.

When getting started, I advise managers to select relatively simple projects, those that involve fewer data attributes, people, departments, and computer systems. Indeed, you need to complete at least one project to gain traction and several to claim real results. Further, there is a real sense of accomplishment, perhaps even euphoria, in completing an improvement project. Those participating are empowered, and achieve a dynamic that leads to continuous improvement. So, smart managers get those first few improvement projects under their belts.

Having selected the improvement project, the manager must then put the right improvement team in place and work with it to establish a charter. It’s best to keep project teams as small as possible, but collectively, members of the team must have a full understanding of the steps where the issue occurs (i.e., the team must be able to surround the problem). This means that a more complex problem will require more people. One person should be designated team leader and another “facilitator.”

My rule of thumb is that step two is complete only when the project team and process owner agree to a “charter,” including a specific error reduction goal and time frame. Thus (from Figure 4.4), “reduce the error rate associated with attribute D by 50 percent in three months. Reduce it by a further 50 percent, for a total 75 percent reduction, in three more months” is an example of a good charter.

Once chartered, the improvement team works the remaining steps of the QIC and is disbanded when it completes its agreed project.

Establish Control

Dr. Juran defines control as “the managerial act of comparing actual performance against standards and acting on the difference.”25 It is a powerful synthesis that covers situations as diverse as managing your teenager’s curfew to managing a complex organization. Figures 4.6 and 4.7 present the generic control flowchart and flowchart for a common household thermostat respectively. Note that a hidden data factory is a form of control. An expensive and not-so-effective one, but it is a control nonetheless.

The term “governance” refers to the overall system of controls in a data (or any other) program.

The long-term goals for an overall system of data quality controls include:

  • Preventing errors at the point of data creation.
  • Detecting and correcting errors before they do any harm and ideally as quickly as possible.
  • Responding to customer-found errors quickly.
  • Ensuring that the overall quality program works as planned.
  • For some processes, establishing statistical control, or stability.
  • Error-proofing processes and measurement devices.
  • Doing all of this in a cost-effective manner.

Figure 4.6 The generic control process after the work of Dr. Juran.

Figure 4.7 A thermostat effects control.

There are many types of data quality controls. See Table 4.2 for a list.

Table 4.2 Commonly used data creation controls.

Type

Brief Description

When Appropriate

Proofing

Take a hard look at the data and make corrections as appropriate.

To gain traction, when business rules are poorly developed.

Tail-end validation control

Identify and correct “invalid” data using business rules at the end of the day.

To scale up for real results. For sophisticated rules involving multiple attributes.

In-process validation control

Identify and correct “invalid” data using business rules as the data is created. Note: Many good web-based forms don’t let you move on until data you entered pass rules.

For many rules, these validation controls are superior to tail-end controls because they are more effective and cheaper.

Customer-found error control

Correct errors that customers find.

Always. If customers are kind enough to advise you of your errors, you should act quickly.

In-factory calibration

Ensure that measurement equipment measures correctly before the device leaves the factory.

To scale up for real results or getting to the next level. Always needed when there are many devices.

In-use calibration

Re-calibrate measurement equipment to ensure it measures correctly in real use.

To scale up for real results or getting to the next level. Always needed when there are many devices.

Statistical process control

Separate special causes from common causes to establish a basis for predicting future process performance.

For relatively fast-moving processes.

Quality assurance = Audit controls.

Ensure that the overall data quality program is working as designed.

As you get real results and getting to the next level.

I find that many data creators initially employ no controls whatsoever. A simple control or two helps gain traction, then points the way to real results and the next level. So consider the process that produced the data in Figure 4.3. A simple control that called for “eyeballing data created yesterday and making needed corrections at 8:00AM today” would make for a big step up. It is obviously not a perfect control. But people, even non-experts, are quite good at spotting errors, and such a control will catch most of the obvious errors and some of the subtle ones. It is a big step up from no control whatsoever.

To solidify results and scale up, the next step is to use business rules (the same business rules used for measurement) to detect errors as the data is created. If an attribute fails a rule, the control alerts the data creator to correct the value before loading the data.

This control is akin to asking someone to proofread an important report before you send it on and perfectly appropriate for unstructured data.

Finally, statistical process controls help get to the next level.

To conclude this section, I’d like to contrast two controls, applied to the same data, one by the data customer and the other by the data creator. In the example, in step 1, a customer order is created (perhaps the customer is placing the order over the Internet or phoning it in) and in the second, three days later, his or her products are taken from the warehouse and shipped. Of special interest here is the customer’s address and suppose the following is entered:

City, State, Zip Code, Country = Rumson, NJ, 90210, USA

The obvious error is that the (city, state) pair and the zip code don’t align—one or both is incorrect.

Now the shipping clerk, who has to deal with returns, may eyeball the address and notice this error. But it will likely take considerable effort to find the correct address (as the rule of ten suggests).

Luckily, more and more web-based forms employ such controls. Move controls as close to the moments of data creation as you can. Doing so not only makes for better controls, it enables data customers to reduce the size and complexity of their hidden data factories.

A better control, built into the data entry application, would signal the inputter of the error and require correction before moving on. The cost of correction is far, far less!

Innovate, Innovate, Innovate

Data creators should always be on the lookout for ways to innovate. Of particular interest here are:

  • Adding a new feature (i.e., a new data attribute, way of viewing the data, etc.) to data you already deliver to customers. Frankly, you should approach this the same way you approach improvement—always have at least a project or two in the works.
  • Developing wholly new data. Here is where a deep understanding of customer needs comes into play. For there are always unmet customer needs, for new data, deeper insights, a different look, and so forth—often needs that they cannot fully articulate. Focus especially on leading-edge customers, those who are thinking further out into the future. Work with them to articulate those unmet needs. Then figure out new ways to meet them.

In Summary

You need high-quality data from others to do your job and it is only right that the data you create and pass on meet the needs of the next person or team. You must view them as customers and treat them as such. Frankly, in most cases this is not nearly as hard, nor as time consuming, as it might seem at first. You’ll make your customers and your company stronger. And if you’re first in your company to do so, you’ll set a real standard for others.

Following the instructions here enables you to create results as depicted in Figure 4.8. Each milestone bears a letter corresponding to a phase in the overall program; T for gaining traction, RR for achieving real results, and NL for getting to the next level. Consider that “traction has been gained with the completion of the first improvement project and the first real result is achieved when the initial target is met. Notice in particular, the promised order of magnitude improvement. So get on with it!

Figure 4.8 “Milestones” in the advance of a data quality program.

Table 4.3 Indicators of success

You’ve

When

Gained Traction

You’ve successfully understood the needs of one important customer, made a measurement against those needs, and made one real improvement.

Achieved Real Results

You’ve completed several improvement projects such that you’ve made a real difference for at least one customer.

Make it to the Next Level

You’re actively managing data creation such that all your most important customers are enjoying increasingly higher-quality data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset