APPENDIX A: Special Instructions for Decision Makers and Data Scientists

APPENDIX A42

This appendix is aimed at senior decision makers and data scientists whose needs for quality data are extremely stringent. A decision or analysis sent in the wrong direction due to bad data is far more costly than the operational expenses contemplated by the rule of ten. At the same time, you may have to deal with wholly new data that, when combined with existing data, could offer potentially game-changing insights. But there isn’t a clear indication whether this new information can be trusted. How should you proceed?

There is, of course, no simple answer. While many are skeptical of new data (perhaps having been burned) and others embrace it wholeheartedly (perhaps in a headlong rush to establish their credentials as data-driven), I recommend a nuanced approach. For it is highly likely that some data (maybe even most of it) is bad and can’t be used, and some is good and should be trusted implicitly. Finally some data is flawed but usable with caution. This data is intriguing as many game-changing insights reside there. So how should you separate the good data from bad?

First, evaluate the data’s origins. You can trust data when it is created in accordance with a first-rate data quality program (refer to Chapter 4). Such programs feature clear accountabilities, input controls, and efforts to find and eliminate the root causes of error. You’ll not have to opine whether the data are good—data quality statistics will tell you. You’ll find a human being who’ll be happy to explain what you may expect and answer your questions. If the data quality stats look good and the conversation goes well, trust this data. Please note that this is the “gold standard” against which other steps below indicators should be calibrated.

Second, make your own assessment. Much, perhaps most, data will not meet the gold standard, so adopt a more cautious attitude. Make sure you know where the data was created and how it is defined, not just how your data scientist accessed it. It is easy to be misled by a casual, “We took them from our cloud-based data warehouse, which employs the latest technology,” and completely miss the fact that the data was created in a dubious public forum. Figure out which organization created the data. Then dig deeper: What do colleagues advise about this organization and data? Does it have a good or poor reputation for quality? What do others say on social media? Do some research both inside and outside your organization.

At the same time, develop your own data quality statistics, possibly using the Friday Afternoon Measurement described in Chapter 2. If you see only a little red, say less than 5% of records with an obvious error, you can use this data with caution. Look too at patterns of the errors. If, for example, there are twenty-five total errors, twenty-four of which occur in one data attribute, eliminate that attribute going forward. But the rest of the data looks pretty good—use it with caution.

Third, clean the data. In this context, I think of data cleaning as on three levels: rinse, wash, and scrub. Rinse replaces obvious errors with “missing value” or corrects them if doing so is very easy; scrub involves deep study, even making corrections one-at-a-time, by hand, if necessary; and wash occupies a middle ground. Even if time is short, scrub a small random sample (say 1000 records), making them as pristine as you possibly can. Your goal is to arrive at a sample of data you know you can trust. Employ all possible means of scrubbing and be ruthless! Eliminate erred data records and data elements that you cannot correct, and mark data as “uncertain” when applicable.

When you are done, take a hard look. When the scrubbing has gone really well (and you’ll know it if it does), you’ve created a dataset that rates high on the trustworthy-scale. Use these data going forward.

Sometimes the scrubbing is less satisfying. If you’ve done the best you can, but still feel uncertain, put these data in the “use with caution” category. If the scrubbing goes poorly – for example, too much data just looks wrong and you can’t make corrections – you must rate this data, and all like it, as untrustworthy. The sample strongly suggests none of this data should be used to inform your decision.

After the initial scrub, move on to the second cleaning exercise: washing the remaining data that was not in the scrubbing sample. This step should be performed by a truly competent data scientist. Since scrubbing can be a time-consuming, manual process, the wash allows you to make corrections using more automatic processes. For example, one wash technique involves statistical imputation,43 to replace missing values. Or data scientists may discover algorithms during scrubbing. Put the data where washing goes well in the “use with caution” category.

Figure A.1 summarizes. Once you’ve identified a set of data that you can trust or use with caution, move on to the next step of integration. Finally, ensure high-quality data integration. Align the data you can trust – or the data that you’re moving forward with cautiously – with your existing data. This is difficult, technical work, again only to be performed only by a qualified data scientist. Three tasks must be completed:

  • Identification: Verify that the Courtney Smith in one dataset is the same Courtney Smith in others.
  • Alignment of units of measure and data definitions: Make sure Courtney’s purchases and prices paid, expressed in “pallets” and “dollars” in one set, are aligned with “units” and “euros” in another.
  • De-duplication: Check that the Courtney Smith record does not appear multiple times in different ways (say as C. Smith or Courtney E. Smith).

Figure A.1 Steps a data scientist or senior decision maker should take to evaluate the trustworthiness of data.

At this point, you’re ready to perform whatever analyses you need to guide your decision. Pay particular attention when you get different results based on “use with caution” and “trusted” data. Great insights and, unfortunately, great traps lie here. When a result looks intriguing, isolate the data and repeat the steps above, making more detailed measurements, scrubbing the data, and improving wash routines. As you do so, develop a feel for how deeply you should trust this data.

Taking the steps above will gain you a lot. Understanding where you can trust the data allows you to push the data to its limits. Data doesn’t have to be perfect to yield new insights, but you must exercise caution by understanding where the flaws lie, working around errors, cleaning them up, and backing off when the data simply isn’t good enough.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset