Data Quality Metrics

,

In Chapter 6, we briefly described the purpose of data quality metrics as part of a self-feeding process for continuous improvement. We also discussed establishing and creating a data quality baseline for better understanding the current state of the data and its proper business alignment and fitness for use.

This chapter expands that concept by defining means for creating a scalable and sustainable process in which data quality metrics become the central point for data quality assessment and consequently a critical source for data quality proactive initiatives.

Data quality metrics falls into two main categories: (1) monitoring and (2) scorecards or dashboards. Monitors are used to detect violations that usually require immediate corrective actions. Scorecards or dashboards allow for numbers to be associated with the quality of the data and are more snapshot-in-time reports as opposed to real-time triggers. Notice that results of monitor reports can be included in the overall calculation of scorecards and dashboards, as well.

Data quality metrics need to be aligned with business key performance indicators (KPI) throughout the company. Each LOB will have a list of KPIs for its particular needs, which need to be collected by the data quality forum and properly implemented into a set of monitors and/or scorecards.

Associating KPIs to metrics is critical for two reasons:

1. As discussed earlier, all data quality activities need to serve a business purpose, and data quality metrics are no different.

2. KPIs are directly related to ROI. Metrics provide the underlying mechanism for associating numbers to KPIs and consequently ROI. They become a powerful instrument for assessing the improvement achieved through a comprehensive data quality ongoing effort, which is key to an overall MDM program.

The actual techniques for measuring the quality of the data for both monitors and scorecards are virtually the same. The difference is primarily related to the time necessary for the business to react. If a critical KPI is associated with a given metric, a monitor should be in place to quickly alert the business about any out-of-spec measurements.

Data quality level agreements (DQLAs) are an effective method to capture business requirements and establish proper expectations related to needed metrics. Well-documented requirements and well-communicated expectations can avoid undesirable situations and a stressed relationship between the data quality team and the business and/or IT, which can be devastating to an overall company-wide data quality program.

The next two sections describe typical DQLA and report components for monitors and scorecards.

Monitors

Bad data exists in the system and is constantly being introduced by apparently inoffensive business operations that are theoretically following proper processes. Furthermore, system bugs and limitations can contribute to data quality degradation, as well.

But not all data quality issues are made equal. Some will impact the business more than others. Certain issues can have a very direct business implication and need to be avoided at all costs. Monitors should be established against these sensitive attributes to alert the business regarding their occurrence so proper action can be taken.

A typical DQLA between the business and the data quality team will include the following information regarding each monitor to be implemented:

  • ID. Data quality monitor identification.
  • Title. A unique title for the monitor.
  • Description. A detailed description that expresses what needs to be measured.
  • KPI. Key performance indicator associated with what is measured.
  • Data quality dimension. Helps organize and qualify the report into dimensions, such as completeness, accuracy, consistency, uniqueness, validity, timeliness, and so on.
  • LOB(s) impacted. List of business area(s) impacted by violations being monitored.
  • Measurement unit. Specifies expected unit of measurement, such as number of occurrences, or percentage.
  • Target value. Quality level expected.
  • Threshold. Specifications for lowest quality acceptable, potentially separated into ranges such as acceptable (green), warning (yellow), or critical (red).
  • Measurement frequency. How often the monitor runs (e.g., daily or weekly).
  • Point of contact. Primary person or group responsible for receiving the monitor report and taking any appropriate actions based on the results.
  • Root cause of the problem. When a monitor is requested for an out-of-spec condition, it is important to understand what is causing the incident to occur.
  • Has the root cause been addressed? Prevention is always the best solution for data quality problems. If a data issue can be avoided at reasonable costs, it should be pursued.

Table 8.1 describes a potential scenario where a monitor is applicable. Notice the explanation of the root cause of the problem, and the measures that are being taken to minimize the issue. Sometimes it is possible to address the root cause of the problem, and over time, eliminate the need of a monitor altogether. In these cases, monitors should be retired when no longer needed.

Table 8.1 Sample Monitor DQLA.

ID DQ001
Title Number of duplicate accounts per customer
Description Business rule requires a single account to exist for a given customer. When duplicate accounts exist, users receive an error when trying to create or update a service contract transaction associated with one of the duplicated accounts.
The probability of users running into duplicate accounts is linearly proportional to the percentage of duplicates. A 1% increase in duplicates translates into a 1% probability increase of running into an account error. Each account error delays the completion of the transaction by 4 hours, which increases the cost by 200% per transaction. Keeping the number of duplicates at 5% helps lower the overall cost by 2%.
KPI Lower the overall cost of completing service contract bookings by 5% this quarter.
Dimension Uniqueness
Impacted LOB(s) Services
Unit of meas. Percentage of duplicates
Target value 5%
Threshold ≤ 10% is Green, between 10% and 20% is Yellow, >20% is Red
Frequency Weekly
Contact [email protected]
Root cause Duplicate accounts are a result of incorrect business practices, which are being addressed through proper training, communication, and appropriate business process update.
Fix in progress? ___Yes ___No _rm x_Mitigation ___N/A

The monitor report result is best when presented graphically. The graph type should be picked according to the metric measured, but almost always it is relevant to include a trend analysis report to signal if the violation is getting better or worse with time.

Scorecards

Scorecards are typically useful to measure the aggregate quality of a given data set and classify it in data quality dimensions.

Recall the data quality baseline in Chapter 6, and the sample shown in Table 6.1. In essence, the numbers for a scorecard can be obtained from regularly executed baseline assessments. The individual scores can be organized in many ways needed by the business, and presented in a dashboard format.

Table 8.2 shows a subset of Table 6.1, but it also adds threshold, which will be discussed shortly. The objective is to obtain a score for a particular combination of context, entity(ies), attribute(s), and data quality dimension. Once the score is available, the scorecard report or dashboard can be organized in many different ways, such as:

  • The aggregate score for a given context and entity in a certain dimension, such as accuracy for address in the U.S. is 74 percent.
  • The aggregate score for a given entity in a certain dimension, such as completeness for all customer attributes is 62 percent.
  • An overall score for a given data quality dimension, such as consistency is 64 percent.
  • An overall score for all data quality dimensions, which represents the overall score of the entire data set being measured. This associates a single number to the quality of all data measured, which becomes a great thermometer regarding the data quality efforts within the company.

Table 8.2 Foundation Scores for the Data Quality Scorecard.

img

The threshold should be set according to business needs. Data quality issues represented by scores in the red or the yellow categories should be the targets of specific data quality projects. Furthermore, the scorecard itself will become an indicator of the improvements achieved.

The scorecard becomes a powerful tool for the following reasons:

  • It assigns a number to the quality of the data, which is critical to determining if the data is getting better or suffering degradation.
  • It can be used to assess the impact a newly migrated source has on the overall quality of the existing data.
  • It clearly identifies areas that need improvement.

Notice the scorecard alone may not be sufficient to determine the root cause of the problem or to plan a data quality project in detail. The scorecard will highlight the area that needs improvement as well as measure enhancement and deterioration, but it might still be necessary to profile the data and perform root cause analysis to clearly state the best way to solve the problem.

The DQLA for scorecards between the business and the data quality team can follow a format similar to Table 8.2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset