12 Data Quality Management

Data Quality Management (DQM) is the tenth Data Management Function in the data management framework shown in Figures 1.3 and 1.4. It is the ninth data management function that interacts with, and is influenced by, the Data Governance function. Chapter 12 defines the data quality management function and explains the concepts and activities involved in DQM.

12.1 Introduction

Data Quality Management (DQM) is a critical support process in organizational change management. Changing business focus, corporate business integration strategies, and mergers, acquisitions, and partnering can mandate that the IT function blend data sources, create gold data copies, retrospectively populate data, or integrate data. The goals of interoperability with legacy or B2B systems need the support of a DQM program.

Data quality is synonymous with information quality, since poor data quality results in inaccurate information and poor business performance. Data cleansing may result in short-term and costly improvements that do not address the root causes of data defects. A more rigorous data quality program is necessary to provide an economic solution to improved data quality and integrity.

In a program approach, these issues involve more than just correcting data. Instead, they involve managing the lifecycle for data creation, transformation, and transmission to ensure that the resulting information meets the needs of all the data consumers within the organization.

Institutionalizing processes for data quality oversight, management, and improvement hinges on identifying the business needs for quality data and determining the best ways to measure, monitor, control, and report on the quality of data. After identifying issues in the data processing streams, notify the appropriate data stewards to take corrective action that addresses the acute issue, while simultaneously enabling elimination of its root cause.

DQM is also a continuous process for defining the parameters for specifying acceptable levels of data quality to meet business needs, and for ensuring that data quality meets these levels. DQM involves analyzing the quality of data, identifying data anomalies, and defining business requirements and corresponding business rules for asserting the required data quality. DQM involves instituting inspection and control processes to monitor conformance with defined data quality rules, as well as instituting data parsing, standardization, cleansing, and consolidation, when necessary. Lastly, DQM incorporates issues tracking as a way of monitoring compliance with defined data quality Service Level Agreements.

The context for data quality management is shown in Figure 12.1.

Figure 12.1 Data Quality Management Context Diagram

12.2 Concepts and Activities

Data quality expectations provide the inputs necessary to define the data quality framework. The framework includes defining the requirements, inspection policies, measures, and monitors that reflect changes in data quality and performance. These requirements reflect three aspects of business data expectations: a manner to record the expectation in business rules, a way to measure the quality of data within that dimension, and an acceptability threshold.

12.2.1 Data Quality Management Approach

The general approach to DQM, shown in Figure 12.2, is a version of the Deming cycle. Deming, one of the seminal writers in quality management, proposes a problem-solving model10 known as ‘plan-do-study-act’ or ‘plan-do-check-act’ that is useful for data quality management. When applied to data quality within the constraints of defined data quality SLAs, it involves:

  • Planning for the assessment of the current state and identification of key metrics for measuring data quality.
  • Deploying processes for measuring and improving the quality of data.
  • Monitoring and measuring the levels in relation to the defined business expectations.
  • Acting to resolve any identified issues to improve data quality and better meet business expectations.

F:Knowledge IntegrityDMBOKcycle.jpg

Figure 12.2 The Data Quality Management Cycle.

The DQM cycle begins by identifying the data issues that are critical to the achievement of business objectives, defining business requirements for data quality, identifying key data quality dimensions, and defining the business rules critical to ensuring high quality data.

In the plan stage, the data quality team assesses the scope of known issues, which involve determining the cost and impact of the issues and evaluating alternatives for addressing them.

In the deploy stage, profile the data and institute inspections and monitors to identify data issues when they occur. During this stage, the data quality team can arrange for fixing flawed processes that are the root cause of data errors, or as a last resort, correcting errors downstream. When it is not possible to correct errors at their source, correct errors at the earliest point in the data flow.

The monitor stage is for actively monitoring the quality of data as measured against the defined business rules. As long as data quality meets defined thresholds for acceptability, the processes are in control and the level of data quality meets the business requirements. However, if the data quality falls below acceptability thresholds, notify data stewards so they can take action during the next stage.

The act stage is for taking action to address and resolve emerging data quality issues.

New cycles begin as new data sets come under investigation, or as new data quality requirements are identified for existing data sets.

12.2.2 Develop and Promote Data Quality Awareness

Promoting data quality awareness means more than ensuring that the right people in the organization are aware of the existence of data quality issues. Promoting data quality awareness is essential to ensure buy-in of necessary stakeholders in the organization, thereby greatly increasing the chance of success of any DQM program.

Awareness includes relating material impacts to data issues, ensuring systematic approaches to regulators and oversight of the quality of organizational data, and socializing the concept that data quality problems cannot be solely addressed by technology solutions. As an initial step, some level of training on the core concepts of data quality may be necessary.

The next step includes establishing a data governance framework for data quality. Data governance is a collection of processes and procedures for assigning responsibility and accountability for all facets of data management, covered in detail in Chapter 3. DQM data governance tasks include:

  • Engaging business partners who will work with the data quality team and champion the DQM program.
  • Identifying data ownership roles and responsibilities, including data governance board members and data stewards.
  • Assigning accountability and responsibility for critical data elements and DQM.
  • Identifying key data quality areas to address and directives to the organization around these key areas.
  • Synchronizing data elements used across the lines of business and providing clear, unambiguous definitions, use of value domains, and data quality rules.
  • Continuously reporting on the measured levels of data quality.
  • Introducing the concepts of data requirements analysis as part of the overall system development life cycle.
  • Tying high quality data to individual performance objectives.

Ultimately, a Data Quality Oversight Board can be created that has a reporting hierarchy associated with the different data governance roles. Data stewards who align with business clients, lines of business, and even specific applications, will continue to promote awareness of data quality while monitoring their assigned data assets. The Data Quality Oversight Board is accountable for the policies and procedures for oversight of the data quality community. The guidance provided includes:

  • Setting priorities for data quality.
  • Developing and maintaining standards for data quality.
  • Reporting relevant measurements of enterprise-wide data quality.
  • Providing guidance that facilitates staff involvement.
  • Establishing communications mechanisms for knowledge sharing.
  • Developing and applying certification and compliance policies.
  • Monitoring and reporting on performance.
  • Identifying opportunities for improvements and building consensus for approval.
  • Resolving variations and conflicts.

The constituent participants work together to define and popularize a data quality strategy and framework; develop, formalize, and approve information policies, data quality standards and protocols; and certify line-of-business conformance to the desired level of business user expectations.

12.2.3 Define Data Quality Requirements

Quality of the data must be understood within the context of ‘fitness for use’. Most applications are dependent on the use of data that meets specific needs associated with the successful completion of a business process. Those business processes implement business policies imposed both through external means, such as regulatory compliance, observance of industry standards, or complying with data exchange formats, and through internal means, such as internal rules guiding marketing, sales, commissions, logistics, and so on. Data quality requirements are often hidden within defined business policies. Incremental detailed review and iterative refinement of the business policies helps to identify those information requirements which, in turn, become data quality rules.

Measuring conformance to ‘fitness for use’ requirements enables the reporting of meaningful metrics associated with well-defined data quality dimensions. The incremental detailed review steps include:

  1. Identifying key data components associated with business policies.
  2. Determining how identified data assertions affect the business.
  3. Evaluating how data errors are categorized within a set of data quality dimensions.
  4. Specifying the business rules that measure the occurrence of data errors.
  5. Providing a means for implementing measurement processes that assess conformance to those business rules.

Segment the business rules according to the dimensions of data quality that characterize the measurement of high-level indicators. Include details on the level of granularity of the measurement, such as data value, data element, data record, and data table, that are required for proper implementation. Dimensions of data quality include:

  • Accuracy: Data accuracy refers to the degree that data correctly represents the “real-life” entities they model. In many cases, measure accuracy by how the values agree with an identified reference source of correct information, such as comparing values against a database of record or a similar corroborative set of data values from another table, checking against dynamically computed values, or perhaps applying a manual process to check value accuracy.
  • Completeness: One expectation of completeness indicates that certain attributes always have assigned values in a data set. Another expectation of completeness is that all appropriate rows in a dataset are present. Assign completeness rules to a data set in varying levels of constraint–mandatory attributes that require a value, data elements with conditionally optional values, and inapplicable attribute values. See completeness as also encompassing usability and appropriateness of data values.
  • Consistency: Consistency refers to ensuring that data values in one data set are consistent with values in another data set. The concept of consistency is relatively broad; it can include an expectation that two data values drawn from separate data sets must not conflict with each other, or define consistency with a set of predefined constraints. Encapsulate more formal consistency constraints as a set of rules that specify consistency relationships between values of attributes, either across a record or message, or along all values of a single attribute. However, care must be taken not to confuse consistency with accuracy or correctness. Consistency may be defined between one set of attribute values and another attribute set within the same record (record-level consistency), between one set of attribute values and another attribute set in different records (cross-record consistency), or between one set of attribute values and the same attribute set within the same record at different points in time (temporal consistency).
  • Currency: Data currency refers to the degree to which information is current with the world that it models. Data currency measures how “fresh” the data is, as well as correctness in the face of possible time-related changes. Measure data currency as a function of the expected frequency rate at which different data elements refresh, as well as verify that the data is up to date. Data currency rules define the “lifetime” of a data value before it expires or needs updating.
  • Precision: Precision refers to the level of detail of the data element. Numeric data may need accuracy to several significant digits. For example, rounding and truncating may introduce errors where exact precision is necessary.
  • Privacy: Privacy refers to the need for access control and usage monitoring. Some data elements require limits of usage or access.
  • Reasonableness: Use reasonableness to consider consistency expectations relevant within specific operational contexts. For example, one might expect that the number of transactions each day does not exceed 105% of the running average number of transactions for the previous 30 days.
  • Referential Integrity: Referential integrity is the condition that exists when all intended references from data in one column of a table to data in another column of the same or different table is valid. Referential integrity expectations include specifying that when a unique identifier appears as a foreign key, the record to which that key refers actually exists. Referential integrity rules also manifest as constraints against duplication, to ensure that each entity occurs once, and only once.
  • Timeliness: Timeliness refers to the time expectation for accessibility and availability of information. As an example, measure one aspect of timeliness as the time between when information is expected and when it is readily available for use.
  • Uniqueness: Essentially, uniqueness states that no entity exists more than once within the data set. Asserting uniqueness of the entities within a data set implies that no entity exists more than once within the data set and that a key value relates to each unique entity, and only that specific entity, within the data set. Many organizations prefer a level of controlled redundancy in their data as a more achievable target.
  • Validity: Validity refers to whether data instances are stored, exchanged, or presented in a format that is consistent with the domain of values, as well as consistent with other similar attribute values. Validity ensures that data values conform to numerous attributes associated with the data element: its data type, precision, format patterns, use of a predefined enumeration of values, domain ranges, underlying storage formats, and so on. Validating to determine possible values is not the same as verifying to determine accurate values.

12.2.4 Profile, Analyze and Assess Data Quality

Prior to defining data quality metrics, it is crucial to perform an assessment of the data using two different approaches, bottom-up and top-down.

The bottom-up assessment of existing data quality issues involves inspection and evaluation of the data sets themselves. Direct data analysis will reveal potential data anomalies that should be brought to the attention of subject matter experts for validation and analysis. Bottom-up approaches highlight potential issues based on the results of automated processes, such as frequency analysis, duplicate analysis, cross-data set dependency, ‘orphan child’ data rows, and redundancy analysis.

However, potential anomalies, and even true data flaws may not be relevant within the business context unless vetted with the constituency of data consumers. The top-down approach to data quality assessment involves engaging business users to document their business processes and the corresponding critical data dependencies. The top-down approach involves understanding how their processes consume data, and which data elements are critical to the success of the business application. By reviewing the types of reported, documented, and diagnosed data flaws, the data quality analyst can assess the kinds of business impacts that are associated with data issues.

The steps of the analysis process are:

  • Identify a data set for review.
  • Catalog the business uses of that data set.
  • Subject the data set to empirical analysis using data profiling tools and techniques.
  • List all potential anomalies.
  • For each anomaly:
    • Review the anomaly with a subject matter expert to determine if it represents a true data flaw.
    • Evaluate potential business impacts.
  • Prioritize criticality of important anomalies in preparation for defining data quality metrics.

In essence, the process uses statistical analysis of many aspects of data sets to evaluate:

  • The percentage of the records populated.
  • The number of data values populating each data attribute.
  • Frequently occurring values.
  • Potential outliers.
  • Relationships between columns within the same table.
  • Relationships across tables.

Use these statistics to identify any obvious data issues that may have high impact and that are suitable for continuous monitoring as part of ongoing data quality inspection and control. Interestingly, important business intelligence may be uncovered just in this analysis step. For instance, an event in the data that occurs rarely (an outlier) may point to an important business fact, such as a rare equipment failure may be linked to a suspected underachieving supplier.

12.2.5 Define Data Quality Metrics

The metrics development step does not occur at the end of the lifecycle in order to maintain performance over time for that function, but for DQM, it occurs as part of the strategy / design / plan step in order to implement the function in an organization.

Poor data quality affects the achievement of business objectives. The data quality analyst must seek out and use indicators of data quality performance to report the relationship between flawed data and missed business objectives. Seeking these indicators introduces a challenge of devising an approach for identifying and managing “business-relevant” information quality metrics. View the approach to measuring data quality similarly to monitoring any type of business performance activity; and data quality metrics should exhibit the characteristics of reasonable metrics defined in the context of the types of data quality dimensions as discussed in a previous section. These characteristics include, but are not limited to:

  • Measurability: A data quality metric must be measurable, and should be quantifiable within a discrete range. Note that while many things are measurable, not all translate into useful metrics, implying the need for business relevance.
  • Business Relevance: The value of the metric is limited if it cannot be related to some aspect of business operations or performance. Therefore, every data quality metric should demonstrate how meeting its acceptability threshold correlates with business expectations.
  • Acceptability: The data quality dimensions frame the business requirements for data quality, and quantifying quality measurements along the identified dimension provides hard evidence of data quality levels. Base the determination of whether the quality of data meets business expectations on specified acceptability thresholds. If the score is equal to or exceeds the acceptability threshold, the quality of the data meets business expectations. If the score is below the acceptability threshold, notify the appropriate data steward and take some action.
  • Accountability / Stewardship: Associated with defined roles indicating notification of the appropriate individuals when the measurement for the metric indicates that the quality does not meet expectations. The business process owner is essentially the one who is accountable, while a data steward may be tasked with taking appropriate corrective action.
  • Controllability: Any measurable characteristic of information that is suitable as a metric should reflect some controllable aspect of the business. In other words, the assessment of the data quality metric’s value within an undesirable range should trigger some action to improve the data being measured.
  • Trackability: Quantifiable metrics enable an organization to measure data quality improvement over time. Tracking helps data stewards monitor activities within the scope of data quality SLAs, and demonstrates the effectiveness of improvement activities. Once an information process is stable, tracking enables instituting statistical control processes to ensure predictability with respect to continuous data quality.

The process for defining data quality metrics is summarized as:

  1. Select one of the identified critical business impacts.
  2. Evaluate the dependent data elements, and data create and update processes associated with that business impact.
  3. For each data element, list any associated data requirements.
  4. For each data expectation, specify the associated dimension of data quality and one or more business rules to use to determine conformance of the data to expectations.
  5. For each selected business rule, describe the process for measuring conformance (explained in the next section).
  6. For each business rule, specify an acceptability threshold (explained in the next section).

The result is a set of measurement processes that provide raw data quality scores that can roll up to quantify conformance to data quality expectations. Measurements that do not meet the specified acceptability thresholds indicate nonconformance, showing that some data remediation is necessary.

12.2.6 Define Data Quality Business Rules

The process of instituting the measurement of conformance to specific business rules requires definition. Monitoring conformance to these business rules requires:

  • Segregating data values, records, and collections of records that do not meet business needs from the valid ones.
  • Generating a notification event alerting a data steward of a potential data quality issue.
  • Establishing an automated or event driven process for aligning or possibly correcting flawed data within business expectations.

The first process uses assertions of expectations of the data. The data sets conform to those assertions or they do not. More complex rules can incorporate those assertions with actions or directives that support the second and third processes, generating a notification when data instances do not conform, or attempting to transform a data value identified as being in error. Use templates to specify these business rules, such as:

  • Value domain membership: Specifying that a data element’s assigned value is selected from among those enumerated in a defined data value domain, such as 2-Character United States Postal Codes for a STATE field.
  • Definitional Conformance: Confirming that the same understanding of data definitions is understood and used properly in processes across the organization. Confirmation includes algorithmic agreement on calculated fields, including any time, or local constraints, and rollup rules.
  • Range conformance: A data element’s assigned value must be within a defined numeric, lexicographic, or time range, such as greater than 0 and less than 100 for a numeric range.
  • Format compliance: One or more patterns specify values assigned to a data element, such as the different ways to specify telephone numbers.
  • Mapping conformance: Indicating that the value assigned to a data element must correspond to one selected from a value domain that maps to other equivalent corresponding value domain(s). The STATE data domain again provides a good example, since state values may be represented using different value domains (USPS Postal codes, FIPS 2-digit codes, full names), and these types of rules validate that “AL” and “01” both map to “Alabama.”
  • Value presence and record completeness: Rules defining the conditions under which missing values are unacceptable.
  • Consistency rules: Conditional assertions that refer to maintaining a relationship between two (or more) attributes based on the actual values of those attributes.
  • Accuracy verification: Compare a data value against a corresponding value in a system of record to verify that the values match.
  • Uniqueness verification: Rules that specify which entities must have a unique representation and verify that one and only one record exists for each represented real world object.
  • Timeliness validation: Rules that indicate the characteristics associated with expectations for accessibility and availability of data.

Other types of rules may involve aggregate functions applied to sets of data instances. Examples include validating reasonableness of the number of records in a file, the reasonableness of the average amount in a set of transactions, or the expected variance in the count of transactions over a specified timeframe.

Providing rule templates helps bridge the gap in communicating between the business team and the technical team. Rule templates convey the essence of the business expectation. It is possible to exploit the rule templates when a need exists to transform rules into formats suitable for execution, such as embedded within a rules engine, or the data analyzer component of a data-profiling tool, or code in a data integration tool.

12.2.7 Test and Validate Data Quality Requirements

Data profiling tools analyze data to find potential anomalies, as described in section 12.3.1. Use these same tools for rule validation as well. Rules discovered or defined during the data quality assessment phase are then referenced in measuring conformance as part of the operational processes.

Most data profiling tools allow data analysts to define data rules for validation, assessing frequency distributions and corresponding measurements, and then applying the defined rules against the data sets.

Reviewing the results, and verifying whether data flagged as non-conformant is truly incorrect, provides one level of testing. In addition, it is necessary to review the defined business rules with the business clients to make sure that they understand them, and that the business rules correspond to their business requirements.

Characterizing data quality levels based on data rule conformance provides an objective measure of data quality. By using defined data rules proactively to validate data, an organization can distinguish those records that conform to defined data quality expectations and those that do not. In turn, these data rules are used to baseline the current level of data quality as compared to ongoing audits.

12.2.8 Set and Evaluate Data Quality Service Levels

Data quality inspection and monitoring are used to measure and monitor compliance with defined data quality rules. Data quality SLAs (Service Level Agreements) specify the organization’s expectations for response and remediation. Data quality inspection helps to reduce the number of errors. While enabling the isolation and root cause analysis of data flaws, there is an expectation that the operational procedures will provide a scheme for remediation of the root cause within an agreed-to timeframe.

Having data quality inspection and monitoring in place increases the likelihood of detection and remediation of a data quality issue before a significant business impact can occur.

Operational data quality control defined in a data quality SLA, includes:

  • The data elements covered by the agreement.
  • The business impacts associated with data flaws.
  • The data quality dimensions associated with each data element.
  • The expectations for quality for each data element for each of the identified dimensions in each application or system in the value chain.
  • The methods for measuring against those expectations.
  • The acceptability threshold for each measurement.
  • The individual(s) to be notified in case the acceptability threshold is not met.The timelines and deadlines for expected resolution or remediation of the issue.
  • The escalation strategy and possible rewards and penalties when the resolution times are met.

The data quality SLA also defines the roles and responsibilities associated with performance of operational data quality procedures. The operational data quality procedures provide reports on the conformance to the defined business rules, as well as monitoring staff performance in reacting to data quality incidents. Data stewards and the operational data quality staff, while upholding the level of data quality service, should take their data quality SLA constraints into consideration and connect data quality to individual performance plans.

When issues are not addressed within the specified resolution times, an escalation process must exist to communicate non-observance of the level of service up the management chain. The data quality SLA establishes the time limits for notification generation, the names of those in that management chain, and when escalation needs to occur. Given the set of data quality rules, methods for measuring conformance, the acceptability thresholds defined by the business clients, and the service level agreements, the data quality team can monitor compliance of the data to the business expectations, as well as how well the data quality team performs on the procedures associated with data errors.

12.2.9 Continuously Measure and Monitor Data Quality

The operational DQM procedures depend on available services for measuring and monitoring the quality of data. For conformance to data quality business rules, two contexts for control and measurement exist: in-stream and batch. In turn, apply measurements at three levels of granularity, namely data element value, data instance or record, and data set, making six possible measures. Collect in-stream measurements while creating the data, and perform batch activities on collections of data instances assembled in a data set, likely in persistent storage.

Provide continuous monitoring by incorporating control and measurement processes into the information processing flow. It is unlikely that data set measurements can be performed in-stream, since the measurement may need the entire set. The only in-stream points are when full data sets hand off between processing stages. Incorporate data quality rules using the techniques detailed in Table 12.1. Incorporating the results of the control and measurement processes into both the operational procedures and reporting frameworks enable continuous monitoring of the levels of data quality.

12.2.10 Manage Data Quality Issues

Supporting the enforcement of the data quality SLA requires a mechanism for reporting and tracking data quality incidents and activities for researching and resolving those incidents. A data quality incident reporting system can provide this capability. It can log the evaluation, initial diagnosis, and subsequent actions associated with data quality events. Tracking of data quality incidents can also provide performance reporting data, including mean-time-to-resolve issues, frequency of occurrence of issues, types of issues, sources of issues, and common approaches for correcting or eliminating problems. A good issues tracking system will eventually become a reference source of current and historic issues, their statuses, and any factors that may need the actions of others not directly involved in the resolution of the issue.

Granularity

In-stream

Batch

Data Element:

Completeness, structural consistency, reasonableness

Edit checks in application

Data element validation services

Specially programmed applications

Direct queries

Data profiling or analyzer tool

Data Record:

Completeness, structural consistency, semantic consistency, reasonableness

Edit checks in application

Data record validation services

Specially programmed applications

Direct queries

Data profiling or analyzer tool

Data Set:

Aggregate measures, such as record counts, sums, mean, variance

Inspection inserted between processing stages

Direct queries

Data profiling or analyzer tool

Table 12.1 Techniques for incorporating measurement and monitoring.

Many organizations already have incident reporting systems for tracking and managing software, hardware, and network issues. Incorporating data quality incident tracking focuses on organizing the categories of data issues into the incident hierarchies. Data quality incident tracking also requires a focus on training staff to recognize when data issues appear and how they are to be classified, logged, and tracked according to the data quality SLA. The steps involve some or all of these directives:

  • Standardize data quality issues and activities: Since the terms used to describe data issues may vary across lines of business, it is valuable to standardize the concepts used, which can simplify classification and reporting. Standardization will also make it easier to measure the volume of issues and activities, identify patterns and interdependencies between systems and participants, and report on the overall impact of data quality activities. The classification of an issue may change as the investigation deepens and root causes are exposed.
  • Provide an assignment process for data issues: The operational procedures direct the analysts to assign data quality incidents to individuals for diagnosis and to provide alternatives for resolution. The assignment process should be driven within the incident tracking system, by suggesting those individuals with specific areas of expertise.
  • Manage issue escalation procedures: Data quality issue handling requires a well-defined system of escalation based on the impact, duration, or urgency of an issue. Specify the sequence of escalation within the data quality SLA. The incident tracking system will implement the escalation procedures, which helps expedite efficient handling and resolution of data issues.
  • Manage data quality resolution workflow: The data quality SLA specifies objectives for monitoring, control, and resolution, all of which define a collection of operational workflows. The incident tracking system can support workflow management to track progress with issues diagnosis and resolution.

Implementing a data quality issues tracking system provides a number of benefits. First, information and knowledge sharing can improve performance and reduce duplication of effort. Second, an analysis of all the issues will help data quality team members determine any repetitive patterns, their frequency, and potentially the source of the issue. Employing an issues tracking system trains people to recognize data issues early in the information flows, as a general practice that supports their day-to-day operations. The issues tracking system raw data is input for reporting against the SLA conditions and measures. Depending on the governance established for data quality, SLA reporting can be monthly, quarterly or annually, particularly in cases focused on rewards and penalties.

12.2.11 Clean and Correct Data Quality Defects

The use of business rules for monitoring conformance to expectations leads to two operational activities. The first is to determine and eliminate the root cause of the introduction of errors. The second is to isolate the data items that are incorrect, and provide a means for bringing the data into conformance with expectations. In some situations, it may be as simple as throwing away the results and beginning the corrected information process from the point of error introduction. In other situations, throwing away the results is not possible, which means correcting errors.

Perform data correction in three general ways:

  • Automated correction: Submit the data to data quality and data cleansing techniques using a collection of data transformations and rule-based standardizations, normalizations, and corrections. The modified values are committed without manual intervention. An example is automated address correction, which submits delivery addresses to an address standardizer that, using rules, parsing and standardization, and reference tables, normalizes and then corrects delivery addresses. Environments with well-defined standards, commonly accepted rules, and known error patterns, are best suited to automated cleansing and correction.
  • Manual directed correction: Use automated tools to cleanse and correct data but require manual review before committing the corrections to persistent storage. Apply name and address cleansing, identity resolution, and pattern-based corrections automatically, and some scoring mechanism is used to propose a level of confidence in the correction. Corrections with scores above a particular level of confidence may be committed without review, but corrections with scores below the level of confidence are presented to the data steward for review and approval. Commit all approved corrections, and review those not approved to understand whether or not to adjust the applied underlying rules. Environments in which sensitive data sets require human oversight are good examples of where manual-directed correction may be suited.
  • Manual correction: Data stewards inspect invalid records and determine the correct values, make the corrections, and commit the updated records.

12.2.12 Design and Implement Operational DQM Procedures

Using defined rules for validation of data quality provides a means of integrating data inspection into a set of operational procedures associated with active DQM. Integrate the data quality rules into application services or data services that supplement the data life cycle, either through the introduction of data quality tools and technology, the use of rules engines and reporting tools for monitoring and reporting, or custom-developed applications for data quality inspection.

The operational framework requires these services to be available to the applications and data services, and the results presented to the data quality team members. Data quality operations team members are responsible for four activities. The team must design and implement detailed procedures for operationalizing these activities.

  1. Inspection and monitoring: Either through some automated process or via a manually invoked process, subject the data sets to measurement of conformance to the data quality rules, based on full-scan or sampling methods. Use data profiling tools, data analyzers, and data standardization and identity resolution tools to provide the inspection services. Accumulate the results and then make them available to the data quality operations analyst. The analyst must:
    • Review the measurements and associated metrics.
    • Determine if any acceptability thresholds exist that are not met.
    • Create a new data quality incident report.
    • Assign the incident to a data analyst for diagnosis and evaluation.
  2. Diagnosis and evaluation of remediation alternatives: The objective is to review the symptoms exhibited by the data quality incident, trace through the lineage of the incorrect data, diagnose the type of the problem and where it originated, and pinpoint any potential root causes for the problem. The procedure should also describe how the data analyst would:
    • Review the data issues in the context of the appropriate information processing flows, and track the introduction of the error upstream to isolate the location in the processing where the flaw is introduced.
    • Evaluate whether or not there have been any changes to the environment that would have introduced errors into the system.
    • Evaluate whether or not there are any other process issues that contributed to the data quality incident.
    • Determine whether or not there are external data provider issues that have affected the quality of the data.
    • Evaluate alternatives for addressing the issue, which may include modification of the systems to eliminate root causes, introducing additional inspection and monitoring, direct correction of flawed data, or no action based on the cost of correction versus the value of the data correction.
    • Provide updates to the data quality incident tracking system.
  3. Resolving the issue: Having provided a number of alternatives for resolving the issue, the data quality team must confer with the business data owners to select one of the alternatives to resolve the issue. These procedures should detail how the analysts:
    • Assess the relative costs and merits of the alternatives.
    • Recommend one of the alternatives.
    • Provide a plan for developing and implementing the resolution, which may include both modifying the processes and correcting flawed data.
    • Implement the resolution.
    • Provide updates to the data quality incident tracking system.
  4. Reporting: To provide transparency for the DQM process, there should be periodic reports on the performance status of DQM. The data quality operations team will develop and populate these reports, which include:
    • Data quality scorecard, which provides a high-level view of the scores associated with various metrics, reported to different levels of the organization.
    • Data quality trends, which show over time how the quality of data is measured, and whether the quality indicator levels are trending up or down.
    • Data quality performance, which monitors how well the operational data quality staff is responding to data quality incidents for diagnosis and timely resolution.
    • These reports should align to the metrics and measures in the data quality SLA as much as possible, so that the areas important to the achievement of the data quality SLA are at some level, in internal team reports.

12.2.13 Monitor Operational DQM Procedures and Performance

Accountability is critical to the governance protocols overseeing data quality control. All issues must be assigned to some number of individuals, groups, departments, or organizations. The tracking process should specify and document the ultimate issue accountability to prevent issues from dropping through the cracks. Since the data quality SLA specifies the criteria for evaluating the performance of the data quality team, it is reasonable to expect that the incident tracking system will collect performance data relating to issue resolution, work assignments, volume of issues, frequency of occurrence, as well as the time to respond, diagnose, plan a solution, and resolve issues. These metrics can provide valuable insights into the effectiveness of the current workflow, as well as systems and resource utilization, and are important management data points that can drive continuous operational improvement for data quality control.

12.3 Data Quality Tools

DQM employs well-established tools and techniques. These utilities range in focus from empirically assessing the quality of data through data analysis, to the normalization of data values in accordance with defined business rules, to the ability to identify and resolve duplicate records into a single representation, and to schedule these inspections and changes on a regular basis. Data quality tools can be segregated into four categories of activities: Analysis, Cleansing, Enhancement, and Monitoring. The principal tools used are data profiling, parsing and standardization, data transformation, identity resolution and matching, enhancement, and reporting. Some vendors bundle these functions into more complete data quality solutions.

12.3.1 Data Profiling

Before making any improvements to data, one must first be able to distinguish between good and bad data. The attempt to qualify data quality is a process of analysis and discovery. The analysis involves an objective review of the data values populating data sets through quantitative measures and analyst review. A data analyst may not necessarily be able to pinpoint all instances of flawed data. However, the ability to document situations where data values look like they do not belong provides a means to communicate these instances with subject matter experts, whose business knowledge can confirm the existences of data problems.

Data profiling is a set of algorithms for two purposes:

  • Statistical analysis and assessment of the quality of data values within a data set.
  • Exploring relationships that exist between value collections within and across data sets.

For each column in a table, a data-profiling tool will provide a frequency distribution of the different values, providing insight into the type and use of each column. In addition, column profiling can summarize key characteristics of the values within each column, such as the minimum, maximum, and average values.

Cross-column analysis can expose embedded value dependencies, while inter-table analysis explores overlapping values sets that may represent foreign key relationships between entities. In this way, data profiling analyzes and assesses data anomalies. Most data profiling tools allow for drilling down into the analyzed data for further investigation.

Data profiling can also proactively test against a set of defined (or discovered) business rules. The results can be used to distinguish records that conform to defined data quality expectations from those that don’t, which in turn can contribute to baseline measurements and ongoing auditing that supports the data quality reporting processes.

12.3.2 Parsing and Standardization

Data parsing tools enable the data analyst to define sets of patterns that feed into a rules engine used to distinguish between valid and invalid data values. Actions are triggered upon matching a specific pattern. Extract and rearrange the separate components (commonly referred to as “tokens”) into a standard representation when parsing a valid pattern. When an invalid pattern is recognized, the application may attempt to transform the invalid value into one that meets expectations.

Many data quality issues are situations where a slight variance in data value representation introduces confusion or ambiguity. Parsing and standardizing data values is valuable. For example, consider the different ways telephone numbers expected to conform to a Numbering Plan are formatted. While some have digits, some have alphabetic characters, and all use different special characters for separation. People can recognize each one as being a telephone number. However, in order to determine if these numbers are accurate (perhaps by comparing them to a master customer directory) or to investigate whether duplicate numbers exist when there should be only one for each supplier, the values must be parsed into their component segments (area code, exchange, and line number) and then transformed into a standard format.

The human ability to recognize familiar patterns contributes to our ability to characterize variant data values belonging to the same abstract class of values; people recognize different types of telephone numbers because they conform to frequently used patterns. An analyst describes the format patterns that all represent a data object, such as Person Name, Product Description, and so on. A data quality tool parses data values that conform to any of those patterns, and even transforms them into a single, standardized form that will simplify the assessment, similarity analysis, and cleansing processes. Pattern-based parsing can automate the recognition and subsequent standardization of meaningful value components.

12.3.3 Data Transformation

Upon identification of data errors, trigger data rules to transform the flawed data into a format that is acceptable to the target architecture. Engineer these rules directly within a data integration tool or rely on alternate technologies embedded in or accessible from within the tool. Perform standardization by mapping data from some source pattern into a corresponding target representation. A good example is a “customer name,” since names may be represented in thousands of different forms. A good standardization tool will be able to parse the different components of a customer name, such as given name, middle name, family name, initials, titles, generational designations, and then rearrange those components into a canonical representation that other data services will be able to manipulate.

Data transformation builds on these types of standardization techniques. Guide rule- based transformations by mapping data values in their original formats and patterns into a target representation. Parsed components of a pattern are subjected to rearrangement, corrections, or any changes as directed by the rules in the knowledge base. In fact, standardization is a special case of transformation, employing rules that capture context, linguistics, and idioms recognized as common over time, through repeated analysis by the rules analyst or tool vendor.

12.3.4 Identity Resolution and Matching

Employ record linkage and matching in identity recognition and resolution, and incorporate approaches used to evaluate “similarity” of records for use in duplicate analysis and elimination, merge / purge, house holding, data enhancement, cleansing and strategic initiatives such as customer data integration or master data management. A common data quality problem involves two sides of the same coin:

  • Multiple data instances that actually refer to the same real-world entity.
  • The perception, by an analyst or an application, that a record does not exist for a real-world entity, when in fact it really does.

In the first situation, something introduced similar, yet variant representations in data values into the system. In the second situation, a slight variation in representation prevents the identification of an exact match of the existing record in the data set.

Both of these situations are addressed through a process called similarity analysis, in which the degree of similarity between any two records is scored, most often based on weighted approximate matching between a set of attribute values in the two records. If the score is above a specified threshold, the two records are a match and are presented to the end client as most likely to represent the same entity. It is through similarity analysis that slight variations are recognized and data values are connected and subsequently consolidated.

Attempting to compare each record against all the others to provide a similarity score is not only ambitious, but also time-consuming and computationally intensive. Most data quality tool suites use advanced algorithms for blocking records that are most likely to contain matches into smaller sets, whereupon different approaches are taken to measure similarity. Identifying similar records within the same data set probably means that the records are duplicates, and may need cleansing and / or elimination. Identifying similar records in different sets may indicate a link across the data sets, which helps facilitate cleansing, knowledge discovery, and reverse engineering—all of which contribute to master data aggregation.

Two basic approaches to matching are deterministic and probabilistic. Deterministic matching, like parsing and standardization, relies on defined patterns and rules for assigning weights and scores for determining similarity. Alternatively, probabilistic matching relies on statistical techniques for assessing the probability that any pair of records represents the same entity. Deterministic algorithms are predictable in that the patterns matched and the rules applied will always yield the same matching determination. Tie performance to the variety, number, and order of the matching rules. Deterministic matching works out of the box with relatively good performance, but it is only as good as the situations anticipated by the rules developers.

Probabilistic matching relies on the ability to take data samples for training purposes by looking at the expected results for a subset of the records and tuning the matcher to self-adjust based on statistical analysis. These matchers are not reliant on rules, so the results may be nondeterministic. However, because the probabilities can be refined based on experience, probabilistic matchers are able to improve their matching precision as more data is analyzed.

12.3.5 Enhancement

Increase the value of an organization’s data by enhancing the data. Data enhancement is a method for adding value to information by accumulating additional information about a base set of entities and then merging all the sets of information to provide a focused view of the data. Data enhancement is a process of intelligently adding data from alternate sources as a byproduct of knowledge inferred from applying other data quality techniques, such as parsing, identity resolution, and data cleansing.

Data parsing assigns characteristics to the data values appearing in a data instance, and those characteristics help in determining potential sources for added benefit. For example, if it can be determined that a business name is embedded in an attribute called name, then tag that data value as a business. Use the same approach for any situation in which data values organize into semantic hierarchies.

Appending information about cleansing and standardizations that have been applied provides additional suggestions for later data matching, record linkage, and identity resolution processes. By creating an associative representation of the data that imposes a meta-context on it, and adding detail about the data, more knowledge is collected about the actual content, not just the structure of that information. Associative representation makes more interesting inferences about the data, and consequently enables use of more information for data enhancement. Some examples of data enhancement include:

  • Time / Date stamps: One way to improve data is to document the time and date that data items are created, modified, or retired, which can help to track historical data events.
  • Auditing Information: Auditing can document data lineage, which also is important for historical tracking as well as validation.
  • Contextual Information: Business contexts such as location, environment, and access methods are all examples of context that can augment data. Contextual enhancement also includes tagging data records for downstream review and analysis.
  • Geographic Information: There are a number of geographic enhancements possible, such as address standardization and geocoding, which includes regional coding, municipality, neighborhood mapping, latitude / longitude pairs, or other kinds of location-based data.
  • Demographic Information: For customer data, there are many ways to add demographic enhancements such as customer age, marital status, gender, income, ethnic coding; or for business entities, annual revenue, number of employees, size of occupied space, etc.
  • Psychographic Information: Use these kinds of enhancements to segment the target population by specified behaviors, such as product and brand preferences, organization memberships, leisure activities, vacation preferences, commuting transportation style, shopping time preferences, etc.

12.3.6 Reporting

Inspection and monitoring of conformance to data quality expectations, monitoring performance of data stewards conforming to data quality SLAs, workflow processing for data quality incidents, and manual oversight of data cleansing and correction are all supported by good reporting. It is optimal to have a user interface to report results associated with data quality measurement, metrics, and activity. It is wise to incorporate visualization and reporting for standard reports, scorecards, dashboards, and for provision of ad hoc queries as part of the functional requirements for any acquired data quality tools.

12.4 Summary

The guiding principles for implementing DQM into an organization, a summary table of the roles for each DQM activity, and organization and cultural issues that may arise during database quality management are summarized below.

12.4.1 Setting Data Quality Guiding Principles

When assembling a DQM program, it is reasonable to assert a set of guiding principles that frame the type of processes and uses of technology described in this chapter. Align any activities undertaken to support the data quality practice with one or more of the guiding principles. Every organization is different, with varying motivating factors. Some sample statements that might be useful in a Data Quality Guiding Principles document include:

  • Manage data as a core organizational asset. Many organizations go so far as to place data as an asset on their balance sheets.
  • All data elements will have a standardized data definition, data type, and acceptable value domain.
  • Leverage Data Governance for the control and performance of DQM.
  • Use industry and international data standards whenever possible.
  • Downstream data consumers specify data quality expectations.
  • Define business rules to assert conformance to data quality expectations.
  • Validate data instances and data sets against defined business rules.
  • Business process owners will agree to and abide by data quality SLAs.
  • Apply data corrections at the original source, if possible.
  • If it is not possible to correct data at the source, forward data corrections to the owner of the original source whenever possible. Influence on data brokers to conform to local requirements may be limited.
  • Report measured levels of data quality to appropriate data stewards, business process owners, and SLA managers.
  • Identify a gold record for all data elements.

12.4.2 Process Summary

The process summary for the DQM function is shown in Table 12.2. The deliverables, responsible roles, approving roles, and contributing roles are shown for each activity in the data operations management function. The Table is also shown in Appendix A9.

Activities

Deliverables

Responsible Roles

Approving Roles

Contributing Roles

10.1 Develop and Promote Data Quality Awareness (O)

Data quality training

Data Governance Processes

Established Data Stewardship Council

Data Quality Manager

Business Managers

DRM Director

Information Architects

Subject Matter Experts

10.2 Define Data Quality Requirements ((D)

Data Quality Requirements Document

Data Quality Manager

Data Quality Analysts

Business Managers

DRM Director

Information Architects

Subject Matter Experts

10.3 Profile, Analyze, and Assess Data Quality (D)

Data Quality Assessment Report

Data Quality Analysts

Business Managers

DRM Director

Data Stewardship Council

10.4 Define Data Quality Metrics (P)

Data Quality Metrics Document

Data Quality Manager

Data Quality Analysts

Business Managers

DRM Director

Data Stewardship Council

10.5 Define Data Quality Business Rules (P)

Data Quality Business Rules

Data Quality Analysts

Business Managers

DRM Director

Data Quality Manager

Information Architects

Subject Matter Experts

Data Stewardship Council

10.6 Test and Validate Data Quality Requirements (D)

Data Quality Test Cases

Data Quality Analysts

Business Managers

DRM Director

Information Architects

Subject Matter Experts

10.7 Set and Evaluate Data Quality Service Levels (P)

Data Quality Service Levels

Data Quality Manager

Business Managers

DRM Director

Data Stewardship Council

10.8 Continuously Measure and Monitor Data Quality (C)

Data Quality Reports

Data Quality Manager

Business Managers

DRM Director

Data Stewardship Council

10.9 Manage Data Quality Issues (C)

Data Quality Issues Log

Data Quality Manager

Data Quality Analysts

Business Managers

DRM Director

Data Stewardship Council

10.10 Clean and Correct Data Quality Defects (O)

Data Quality Defect Resolution Log

Data Quality Analysts

Business Managers

DRM Director

Information Architects

Subject Matter Experts

10.11 Design and Implement Operational DQM Procedures (D)

Operational DQM Procedures

Data Quality Manager

Data Quality Analysts

Business Managers

DRM Director

Information Architects

Subject Matter Experts

Data Stewardship Council

10.12 Monitor Operational DQM Procedures and Performance (C)

Operational DQM Metrics

Data Quality Manager

Data Quality Analysts

Business Managers

DRM Director

Data Stewardship Council

Table 12.2 Data Quality Management Process Summary

12.4.3 Organizational and Cultural Issues

Q1: Is it really necessary to have quality data if there are many processes to change the data into information and use the information for business intelligence purposes?

A1: The business intelligence value chain shows that the quality of the data resource directly impacts the business goals of the organization. The foundation of the value chain is the data resource. Information is produced from the data resource through information engineering, much the same as products are developed from raw materials. The information is used by the knowledge workers in an organization to provide the business intelligence necessary to manage the organization. The business intelligence is used to support the business strategies, which in turn support the business goals. Through the business intelligence value chain, the quality of the data directly impacts how successfully the business goals are met. Therefore, the emphasis for quality must be placed on the data resource, not on the process through information development and business intelligence processes.

Q2: Is data quality really free?

A2: Going back to the second law of thermodynamics, a data resource is an open system. Entropy will continue to increase without any limit, meaning the quality of the data resource will continue to decrease without any limit. Energy must be expended to create and maintain a quality data resource. That energy comes at a cost. Both the initial data resource quality and the maintenance of data resource quality come at a cost. Therefore, data quality is not free.

It is less costly to build quality into the data resource from the beginning, than it is to build it in later. It is also less costly to maintain data quality throughout the life of the data resource, than it is to improve the quality in major steps. When the quality of the data resource is allowed to deteriorate, it becomes far more costly to improve the data quality, and it creates a far greater impact on the business. Therefore, quality is not free; but, it is less costly to build in and maintain. What most people mean when they say that data quality is free is that the cost-benefit ratio of maintaining data quality from the beginning is less than the cost-benefit ratio of allowing the data quality to deteriorate.

Q3: Are data quality issues something new that have surfaced recently with evolving technology?

A3: No. Data quality problems have always been there, even back in the 80-column card days. The problem is getting worse with the increased quantity of data being maintained and the age of the data. The problem is also becoming more visible with processing techniques that are both more powerful and are including a wider range of data. Data that appeared to be high quality in yesterday’s isolated systems now show their low quality when combined into today’s organization-wide analysis processes.

Every organization must become aware of the quality of their data if they are to effectively and efficiently use that data to support the business. Any organization that considers data quality to be a recent issue that can be postponed for later consideration, is putting the survival of their business at risk. The current economic climate is not the time to put the company’s survival on the line by ignoring the quality of their data.

Q4: Is there one thing to do more than any other for ensuring high data quality?

A4: The most important thing is to establish a single enterprise-wide data architecture, then build and maintain all data within that single architecture. A single enterprise-wide data architecture does not mean that all data are stored it one central repository. It does mean that all data are developed and managed within the context of a single enterprise-wide data architecture. The data can be deployed as necessary for operational efficiency.

As soon as any organization allows data to be developed within multiple data architectures, or worse yet, without any data architecture, there will be monumental problems with data quality. Even if an attempt is made to coordinate multiple data architectures, there will be considerable data quality problems. Therefore, the most important thing is to manage all data within a single enterprise-wide data architecture.

12.5 Recommended Reading

The references listed below provide additional reading that support the material presented in Chapter 12. These recommended readings are also included in the Bibliography at the end of the Guide.

Batini, Carlo, and Monica Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006. ISBN 3-540-33172-7. 262 pages.

Brackett, Michael H. Data Resource Quality: Turning Bad Habits into Good Practices. Addison-Wesley, 2000. ISBN 0-201-71306-3. 384 pages.

Deming, W. Edwards. Out of the Crisis. The MIT Press, 2000. ISBN 0262541157. 507 pages.

English, Larry. Improving Data Warehouse And Business Information Quality: Methods For Reducing Costs And Increasing Profits. John Wiley & Sons, 1999. ISBN 0-471-25383-9. 518 pages.

Huang, Kuan-Tsae, Yang W. Lee and Richard Y. Wang. Quality Information and Knowledge. Prentice Hall, 1999. ISBN 0-130-10141-9. 250 pages.

Loshin, David. Enterprise Knowledge Management: The Data Quality Approach. Morgan Kaufmann, 2001. ISBN 0-124-55840-2. 494 pages.

Loshin, David. Master Data Management. Morgan Kaufmann, 2009. ISBN 0123742250. 288 pages.

Maydanchik, Arkady. Data Quality Assessment. Technics Publications, LLC, 2007 ISBN 0977140024. 336 pages.

McGilvray, Danette. Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information. Morgan Kaufmann, 2008. ISBN 0123743699. 352 pages.

Olson, Jack E. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003. ISBN 1-558-60891-5. 294 pages.

Redman, Thomas. Data Quality: The Field Guide. Digital Press, 2001. ISBN 1-555-59251-6. 256 pages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset