Data Quality Management (DQM) is the tenth Data Management Function in the data management framework shown in Figures 1.3 and 1.4. It is the ninth data management function that interacts with, and is influenced by, the Data Governance function. Chapter 12 defines the data quality management function and explains the concepts and activities involved in DQM.
12.1 Introduction
Data Quality Management (DQM) is a critical support process in organizational change management. Changing business focus, corporate business integration strategies, and mergers, acquisitions, and partnering can mandate that the IT function blend data sources, create gold data copies, retrospectively populate data, or integrate data. The goals of interoperability with legacy or B2B systems need the support of a DQM program.
Data quality is synonymous with information quality, since poor data quality results in inaccurate information and poor business performance. Data cleansing may result in short-term and costly improvements that do not address the root causes of data defects. A more rigorous data quality program is necessary to provide an economic solution to improved data quality and integrity.
In a program approach, these issues involve more than just correcting data. Instead, they involve managing the lifecycle for data creation, transformation, and transmission to ensure that the resulting information meets the needs of all the data consumers within the organization.
Institutionalizing processes for data quality oversight, management, and improvement hinges on identifying the business needs for quality data and determining the best ways to measure, monitor, control, and report on the quality of data. After identifying issues in the data processing streams, notify the appropriate data stewards to take corrective action that addresses the acute issue, while simultaneously enabling elimination of its root cause.
DQM is also a continuous process for defining the parameters for specifying acceptable levels of data quality to meet business needs, and for ensuring that data quality meets these levels. DQM involves analyzing the quality of data, identifying data anomalies, and defining business requirements and corresponding business rules for asserting the required data quality. DQM involves instituting inspection and control processes to monitor conformance with defined data quality rules, as well as instituting data parsing, standardization, cleansing, and consolidation, when necessary. Lastly, DQM incorporates issues tracking as a way of monitoring compliance with defined data quality Service Level Agreements.
The context for data quality management is shown in Figure 12.1.
Figure 12.1 Data Quality Management Context Diagram
12.2 Concepts and Activities
Data quality expectations provide the inputs necessary to define the data quality framework. The framework includes defining the requirements, inspection policies, measures, and monitors that reflect changes in data quality and performance. These requirements reflect three aspects of business data expectations: a manner to record the expectation in business rules, a way to measure the quality of data within that dimension, and an acceptability threshold.
12.2.1 Data Quality Management Approach
The general approach to DQM, shown in Figure 12.2, is a version of the Deming cycle. Deming, one of the seminal writers in quality management, proposes a problem-solving model10 known as ‘plan-do-study-act’ or ‘plan-do-check-act’ that is useful for data quality management. When applied to data quality within the constraints of defined data quality SLAs, it involves:
Figure 12.2 The Data Quality Management Cycle.
The DQM cycle begins by identifying the data issues that are critical to the achievement of business objectives, defining business requirements for data quality, identifying key data quality dimensions, and defining the business rules critical to ensuring high quality data.
In the plan stage, the data quality team assesses the scope of known issues, which involve determining the cost and impact of the issues and evaluating alternatives for addressing them.
In the deploy stage, profile the data and institute inspections and monitors to identify data issues when they occur. During this stage, the data quality team can arrange for fixing flawed processes that are the root cause of data errors, or as a last resort, correcting errors downstream. When it is not possible to correct errors at their source, correct errors at the earliest point in the data flow.
The monitor stage is for actively monitoring the quality of data as measured against the defined business rules. As long as data quality meets defined thresholds for acceptability, the processes are in control and the level of data quality meets the business requirements. However, if the data quality falls below acceptability thresholds, notify data stewards so they can take action during the next stage.
The act stage is for taking action to address and resolve emerging data quality issues.
New cycles begin as new data sets come under investigation, or as new data quality requirements are identified for existing data sets.
12.2.2 Develop and Promote Data Quality Awareness
Promoting data quality awareness means more than ensuring that the right people in the organization are aware of the existence of data quality issues. Promoting data quality awareness is essential to ensure buy-in of necessary stakeholders in the organization, thereby greatly increasing the chance of success of any DQM program.
Awareness includes relating material impacts to data issues, ensuring systematic approaches to regulators and oversight of the quality of organizational data, and socializing the concept that data quality problems cannot be solely addressed by technology solutions. As an initial step, some level of training on the core concepts of data quality may be necessary.
The next step includes establishing a data governance framework for data quality. Data governance is a collection of processes and procedures for assigning responsibility and accountability for all facets of data management, covered in detail in Chapter 3. DQM data governance tasks include:
Ultimately, a Data Quality Oversight Board can be created that has a reporting hierarchy associated with the different data governance roles. Data stewards who align with business clients, lines of business, and even specific applications, will continue to promote awareness of data quality while monitoring their assigned data assets. The Data Quality Oversight Board is accountable for the policies and procedures for oversight of the data quality community. The guidance provided includes:
The constituent participants work together to define and popularize a data quality strategy and framework; develop, formalize, and approve information policies, data quality standards and protocols; and certify line-of-business conformance to the desired level of business user expectations.
12.2.3 Define Data Quality Requirements
Quality of the data must be understood within the context of ‘fitness for use’. Most applications are dependent on the use of data that meets specific needs associated with the successful completion of a business process. Those business processes implement business policies imposed both through external means, such as regulatory compliance, observance of industry standards, or complying with data exchange formats, and through internal means, such as internal rules guiding marketing, sales, commissions, logistics, and so on. Data quality requirements are often hidden within defined business policies. Incremental detailed review and iterative refinement of the business policies helps to identify those information requirements which, in turn, become data quality rules.
Measuring conformance to ‘fitness for use’ requirements enables the reporting of meaningful metrics associated with well-defined data quality dimensions. The incremental detailed review steps include:
Segment the business rules according to the dimensions of data quality that characterize the measurement of high-level indicators. Include details on the level of granularity of the measurement, such as data value, data element, data record, and data table, that are required for proper implementation. Dimensions of data quality include:
12.2.4 Profile, Analyze and Assess Data Quality
Prior to defining data quality metrics, it is crucial to perform an assessment of the data using two different approaches, bottom-up and top-down.
The bottom-up assessment of existing data quality issues involves inspection and evaluation of the data sets themselves. Direct data analysis will reveal potential data anomalies that should be brought to the attention of subject matter experts for validation and analysis. Bottom-up approaches highlight potential issues based on the results of automated processes, such as frequency analysis, duplicate analysis, cross-data set dependency, ‘orphan child’ data rows, and redundancy analysis.
However, potential anomalies, and even true data flaws may not be relevant within the business context unless vetted with the constituency of data consumers. The top-down approach to data quality assessment involves engaging business users to document their business processes and the corresponding critical data dependencies. The top-down approach involves understanding how their processes consume data, and which data elements are critical to the success of the business application. By reviewing the types of reported, documented, and diagnosed data flaws, the data quality analyst can assess the kinds of business impacts that are associated with data issues.
The steps of the analysis process are:
In essence, the process uses statistical analysis of many aspects of data sets to evaluate:
Use these statistics to identify any obvious data issues that may have high impact and that are suitable for continuous monitoring as part of ongoing data quality inspection and control. Interestingly, important business intelligence may be uncovered just in this analysis step. For instance, an event in the data that occurs rarely (an outlier) may point to an important business fact, such as a rare equipment failure may be linked to a suspected underachieving supplier.
12.2.5 Define Data Quality Metrics
The metrics development step does not occur at the end of the lifecycle in order to maintain performance over time for that function, but for DQM, it occurs as part of the strategy / design / plan step in order to implement the function in an organization.
Poor data quality affects the achievement of business objectives. The data quality analyst must seek out and use indicators of data quality performance to report the relationship between flawed data and missed business objectives. Seeking these indicators introduces a challenge of devising an approach for identifying and managing “business-relevant” information quality metrics. View the approach to measuring data quality similarly to monitoring any type of business performance activity; and data quality metrics should exhibit the characteristics of reasonable metrics defined in the context of the types of data quality dimensions as discussed in a previous section. These characteristics include, but are not limited to:
The process for defining data quality metrics is summarized as:
The result is a set of measurement processes that provide raw data quality scores that can roll up to quantify conformance to data quality expectations. Measurements that do not meet the specified acceptability thresholds indicate nonconformance, showing that some data remediation is necessary.
12.2.6 Define Data Quality Business Rules
The process of instituting the measurement of conformance to specific business rules requires definition. Monitoring conformance to these business rules requires:
The first process uses assertions of expectations of the data. The data sets conform to those assertions or they do not. More complex rules can incorporate those assertions with actions or directives that support the second and third processes, generating a notification when data instances do not conform, or attempting to transform a data value identified as being in error. Use templates to specify these business rules, such as:
Other types of rules may involve aggregate functions applied to sets of data instances. Examples include validating reasonableness of the number of records in a file, the reasonableness of the average amount in a set of transactions, or the expected variance in the count of transactions over a specified timeframe.
Providing rule templates helps bridge the gap in communicating between the business team and the technical team. Rule templates convey the essence of the business expectation. It is possible to exploit the rule templates when a need exists to transform rules into formats suitable for execution, such as embedded within a rules engine, or the data analyzer component of a data-profiling tool, or code in a data integration tool.
12.2.7 Test and Validate Data Quality Requirements
Data profiling tools analyze data to find potential anomalies, as described in section 12.3.1. Use these same tools for rule validation as well. Rules discovered or defined during the data quality assessment phase are then referenced in measuring conformance as part of the operational processes.
Most data profiling tools allow data analysts to define data rules for validation, assessing frequency distributions and corresponding measurements, and then applying the defined rules against the data sets.
Reviewing the results, and verifying whether data flagged as non-conformant is truly incorrect, provides one level of testing. In addition, it is necessary to review the defined business rules with the business clients to make sure that they understand them, and that the business rules correspond to their business requirements.
Characterizing data quality levels based on data rule conformance provides an objective measure of data quality. By using defined data rules proactively to validate data, an organization can distinguish those records that conform to defined data quality expectations and those that do not. In turn, these data rules are used to baseline the current level of data quality as compared to ongoing audits.
12.2.8 Set and Evaluate Data Quality Service Levels
Data quality inspection and monitoring are used to measure and monitor compliance with defined data quality rules. Data quality SLAs (Service Level Agreements) specify the organization’s expectations for response and remediation. Data quality inspection helps to reduce the number of errors. While enabling the isolation and root cause analysis of data flaws, there is an expectation that the operational procedures will provide a scheme for remediation of the root cause within an agreed-to timeframe.
Having data quality inspection and monitoring in place increases the likelihood of detection and remediation of a data quality issue before a significant business impact can occur.
Operational data quality control defined in a data quality SLA, includes:
The data quality SLA also defines the roles and responsibilities associated with performance of operational data quality procedures. The operational data quality procedures provide reports on the conformance to the defined business rules, as well as monitoring staff performance in reacting to data quality incidents. Data stewards and the operational data quality staff, while upholding the level of data quality service, should take their data quality SLA constraints into consideration and connect data quality to individual performance plans.
When issues are not addressed within the specified resolution times, an escalation process must exist to communicate non-observance of the level of service up the management chain. The data quality SLA establishes the time limits for notification generation, the names of those in that management chain, and when escalation needs to occur. Given the set of data quality rules, methods for measuring conformance, the acceptability thresholds defined by the business clients, and the service level agreements, the data quality team can monitor compliance of the data to the business expectations, as well as how well the data quality team performs on the procedures associated with data errors.
12.2.9 Continuously Measure and Monitor Data Quality
The operational DQM procedures depend on available services for measuring and monitoring the quality of data. For conformance to data quality business rules, two contexts for control and measurement exist: in-stream and batch. In turn, apply measurements at three levels of granularity, namely data element value, data instance or record, and data set, making six possible measures. Collect in-stream measurements while creating the data, and perform batch activities on collections of data instances assembled in a data set, likely in persistent storage.
Provide continuous monitoring by incorporating control and measurement processes into the information processing flow. It is unlikely that data set measurements can be performed in-stream, since the measurement may need the entire set. The only in-stream points are when full data sets hand off between processing stages. Incorporate data quality rules using the techniques detailed in Table 12.1. Incorporating the results of the control and measurement processes into both the operational procedures and reporting frameworks enable continuous monitoring of the levels of data quality.
12.2.10 Manage Data Quality Issues
Supporting the enforcement of the data quality SLA requires a mechanism for reporting and tracking data quality incidents and activities for researching and resolving those incidents. A data quality incident reporting system can provide this capability. It can log the evaluation, initial diagnosis, and subsequent actions associated with data quality events. Tracking of data quality incidents can also provide performance reporting data, including mean-time-to-resolve issues, frequency of occurrence of issues, types of issues, sources of issues, and common approaches for correcting or eliminating problems. A good issues tracking system will eventually become a reference source of current and historic issues, their statuses, and any factors that may need the actions of others not directly involved in the resolution of the issue.
Granularity |
In-stream |
Batch |
Data Element: Completeness, structural consistency, reasonableness |
Edit checks in application Data element validation services Specially programmed applications |
Direct queries Data profiling or analyzer tool |
Data Record: Completeness, structural consistency, semantic consistency, reasonableness |
Edit checks in application Data record validation services Specially programmed applications |
Direct queries Data profiling or analyzer tool |
Data Set: Aggregate measures, such as record counts, sums, mean, variance |
Inspection inserted between processing stages |
Direct queries Data profiling or analyzer tool |
Table 12.1 Techniques for incorporating measurement and monitoring.
Many organizations already have incident reporting systems for tracking and managing software, hardware, and network issues. Incorporating data quality incident tracking focuses on organizing the categories of data issues into the incident hierarchies. Data quality incident tracking also requires a focus on training staff to recognize when data issues appear and how they are to be classified, logged, and tracked according to the data quality SLA. The steps involve some or all of these directives:
Implementing a data quality issues tracking system provides a number of benefits. First, information and knowledge sharing can improve performance and reduce duplication of effort. Second, an analysis of all the issues will help data quality team members determine any repetitive patterns, their frequency, and potentially the source of the issue. Employing an issues tracking system trains people to recognize data issues early in the information flows, as a general practice that supports their day-to-day operations. The issues tracking system raw data is input for reporting against the SLA conditions and measures. Depending on the governance established for data quality, SLA reporting can be monthly, quarterly or annually, particularly in cases focused on rewards and penalties.
12.2.11 Clean and Correct Data Quality Defects
The use of business rules for monitoring conformance to expectations leads to two operational activities. The first is to determine and eliminate the root cause of the introduction of errors. The second is to isolate the data items that are incorrect, and provide a means for bringing the data into conformance with expectations. In some situations, it may be as simple as throwing away the results and beginning the corrected information process from the point of error introduction. In other situations, throwing away the results is not possible, which means correcting errors.
Perform data correction in three general ways:
12.2.12 Design and Implement Operational DQM Procedures
Using defined rules for validation of data quality provides a means of integrating data inspection into a set of operational procedures associated with active DQM. Integrate the data quality rules into application services or data services that supplement the data life cycle, either through the introduction of data quality tools and technology, the use of rules engines and reporting tools for monitoring and reporting, or custom-developed applications for data quality inspection.
The operational framework requires these services to be available to the applications and data services, and the results presented to the data quality team members. Data quality operations team members are responsible for four activities. The team must design and implement detailed procedures for operationalizing these activities.
12.2.13 Monitor Operational DQM Procedures and Performance
Accountability is critical to the governance protocols overseeing data quality control. All issues must be assigned to some number of individuals, groups, departments, or organizations. The tracking process should specify and document the ultimate issue accountability to prevent issues from dropping through the cracks. Since the data quality SLA specifies the criteria for evaluating the performance of the data quality team, it is reasonable to expect that the incident tracking system will collect performance data relating to issue resolution, work assignments, volume of issues, frequency of occurrence, as well as the time to respond, diagnose, plan a solution, and resolve issues. These metrics can provide valuable insights into the effectiveness of the current workflow, as well as systems and resource utilization, and are important management data points that can drive continuous operational improvement for data quality control.
12.3 Data Quality Tools
DQM employs well-established tools and techniques. These utilities range in focus from empirically assessing the quality of data through data analysis, to the normalization of data values in accordance with defined business rules, to the ability to identify and resolve duplicate records into a single representation, and to schedule these inspections and changes on a regular basis. Data quality tools can be segregated into four categories of activities: Analysis, Cleansing, Enhancement, and Monitoring. The principal tools used are data profiling, parsing and standardization, data transformation, identity resolution and matching, enhancement, and reporting. Some vendors bundle these functions into more complete data quality solutions.
12.3.1 Data Profiling
Before making any improvements to data, one must first be able to distinguish between good and bad data. The attempt to qualify data quality is a process of analysis and discovery. The analysis involves an objective review of the data values populating data sets through quantitative measures and analyst review. A data analyst may not necessarily be able to pinpoint all instances of flawed data. However, the ability to document situations where data values look like they do not belong provides a means to communicate these instances with subject matter experts, whose business knowledge can confirm the existences of data problems.
Data profiling is a set of algorithms for two purposes:
For each column in a table, a data-profiling tool will provide a frequency distribution of the different values, providing insight into the type and use of each column. In addition, column profiling can summarize key characteristics of the values within each column, such as the minimum, maximum, and average values.
Cross-column analysis can expose embedded value dependencies, while inter-table analysis explores overlapping values sets that may represent foreign key relationships between entities. In this way, data profiling analyzes and assesses data anomalies. Most data profiling tools allow for drilling down into the analyzed data for further investigation.
Data profiling can also proactively test against a set of defined (or discovered) business rules. The results can be used to distinguish records that conform to defined data quality expectations from those that don’t, which in turn can contribute to baseline measurements and ongoing auditing that supports the data quality reporting processes.
12.3.2 Parsing and Standardization
Data parsing tools enable the data analyst to define sets of patterns that feed into a rules engine used to distinguish between valid and invalid data values. Actions are triggered upon matching a specific pattern. Extract and rearrange the separate components (commonly referred to as “tokens”) into a standard representation when parsing a valid pattern. When an invalid pattern is recognized, the application may attempt to transform the invalid value into one that meets expectations.
Many data quality issues are situations where a slight variance in data value representation introduces confusion or ambiguity. Parsing and standardizing data values is valuable. For example, consider the different ways telephone numbers expected to conform to a Numbering Plan are formatted. While some have digits, some have alphabetic characters, and all use different special characters for separation. People can recognize each one as being a telephone number. However, in order to determine if these numbers are accurate (perhaps by comparing them to a master customer directory) or to investigate whether duplicate numbers exist when there should be only one for each supplier, the values must be parsed into their component segments (area code, exchange, and line number) and then transformed into a standard format.
The human ability to recognize familiar patterns contributes to our ability to characterize variant data values belonging to the same abstract class of values; people recognize different types of telephone numbers because they conform to frequently used patterns. An analyst describes the format patterns that all represent a data object, such as Person Name, Product Description, and so on. A data quality tool parses data values that conform to any of those patterns, and even transforms them into a single, standardized form that will simplify the assessment, similarity analysis, and cleansing processes. Pattern-based parsing can automate the recognition and subsequent standardization of meaningful value components.
12.3.3 Data Transformation
Upon identification of data errors, trigger data rules to transform the flawed data into a format that is acceptable to the target architecture. Engineer these rules directly within a data integration tool or rely on alternate technologies embedded in or accessible from within the tool. Perform standardization by mapping data from some source pattern into a corresponding target representation. A good example is a “customer name,” since names may be represented in thousands of different forms. A good standardization tool will be able to parse the different components of a customer name, such as given name, middle name, family name, initials, titles, generational designations, and then rearrange those components into a canonical representation that other data services will be able to manipulate.
Data transformation builds on these types of standardization techniques. Guide rule- based transformations by mapping data values in their original formats and patterns into a target representation. Parsed components of a pattern are subjected to rearrangement, corrections, or any changes as directed by the rules in the knowledge base. In fact, standardization is a special case of transformation, employing rules that capture context, linguistics, and idioms recognized as common over time, through repeated analysis by the rules analyst or tool vendor.
12.3.4 Identity Resolution and Matching
Employ record linkage and matching in identity recognition and resolution, and incorporate approaches used to evaluate “similarity” of records for use in duplicate analysis and elimination, merge / purge, house holding, data enhancement, cleansing and strategic initiatives such as customer data integration or master data management. A common data quality problem involves two sides of the same coin:
In the first situation, something introduced similar, yet variant representations in data values into the system. In the second situation, a slight variation in representation prevents the identification of an exact match of the existing record in the data set.
Both of these situations are addressed through a process called similarity analysis, in which the degree of similarity between any two records is scored, most often based on weighted approximate matching between a set of attribute values in the two records. If the score is above a specified threshold, the two records are a match and are presented to the end client as most likely to represent the same entity. It is through similarity analysis that slight variations are recognized and data values are connected and subsequently consolidated.
Attempting to compare each record against all the others to provide a similarity score is not only ambitious, but also time-consuming and computationally intensive. Most data quality tool suites use advanced algorithms for blocking records that are most likely to contain matches into smaller sets, whereupon different approaches are taken to measure similarity. Identifying similar records within the same data set probably means that the records are duplicates, and may need cleansing and / or elimination. Identifying similar records in different sets may indicate a link across the data sets, which helps facilitate cleansing, knowledge discovery, and reverse engineering—all of which contribute to master data aggregation.
Two basic approaches to matching are deterministic and probabilistic. Deterministic matching, like parsing and standardization, relies on defined patterns and rules for assigning weights and scores for determining similarity. Alternatively, probabilistic matching relies on statistical techniques for assessing the probability that any pair of records represents the same entity. Deterministic algorithms are predictable in that the patterns matched and the rules applied will always yield the same matching determination. Tie performance to the variety, number, and order of the matching rules. Deterministic matching works out of the box with relatively good performance, but it is only as good as the situations anticipated by the rules developers.
Probabilistic matching relies on the ability to take data samples for training purposes by looking at the expected results for a subset of the records and tuning the matcher to self-adjust based on statistical analysis. These matchers are not reliant on rules, so the results may be nondeterministic. However, because the probabilities can be refined based on experience, probabilistic matchers are able to improve their matching precision as more data is analyzed.
12.3.5 Enhancement
Increase the value of an organization’s data by enhancing the data. Data enhancement is a method for adding value to information by accumulating additional information about a base set of entities and then merging all the sets of information to provide a focused view of the data. Data enhancement is a process of intelligently adding data from alternate sources as a byproduct of knowledge inferred from applying other data quality techniques, such as parsing, identity resolution, and data cleansing.
Data parsing assigns characteristics to the data values appearing in a data instance, and those characteristics help in determining potential sources for added benefit. For example, if it can be determined that a business name is embedded in an attribute called name, then tag that data value as a business. Use the same approach for any situation in which data values organize into semantic hierarchies.
Appending information about cleansing and standardizations that have been applied provides additional suggestions for later data matching, record linkage, and identity resolution processes. By creating an associative representation of the data that imposes a meta-context on it, and adding detail about the data, more knowledge is collected about the actual content, not just the structure of that information. Associative representation makes more interesting inferences about the data, and consequently enables use of more information for data enhancement. Some examples of data enhancement include:
12.3.6 Reporting
Inspection and monitoring of conformance to data quality expectations, monitoring performance of data stewards conforming to data quality SLAs, workflow processing for data quality incidents, and manual oversight of data cleansing and correction are all supported by good reporting. It is optimal to have a user interface to report results associated with data quality measurement, metrics, and activity. It is wise to incorporate visualization and reporting for standard reports, scorecards, dashboards, and for provision of ad hoc queries as part of the functional requirements for any acquired data quality tools.
12.4 Summary
The guiding principles for implementing DQM into an organization, a summary table of the roles for each DQM activity, and organization and cultural issues that may arise during database quality management are summarized below.
12.4.1 Setting Data Quality Guiding Principles
When assembling a DQM program, it is reasonable to assert a set of guiding principles that frame the type of processes and uses of technology described in this chapter. Align any activities undertaken to support the data quality practice with one or more of the guiding principles. Every organization is different, with varying motivating factors. Some sample statements that might be useful in a Data Quality Guiding Principles document include:
12.4.2 Process Summary
The process summary for the DQM function is shown in Table 12.2. The deliverables, responsible roles, approving roles, and contributing roles are shown for each activity in the data operations management function. The Table is also shown in Appendix A9.
Activities |
Deliverables |
Responsible Roles |
Approving Roles |
Contributing Roles |
10.1 Develop and Promote Data Quality Awareness (O) |
Data quality training Data Governance Processes Established Data Stewardship Council |
Data Quality Manager |
Business Managers DRM Director |
Information Architects Subject Matter Experts |
10.2 Define Data Quality Requirements ((D) |
Data Quality Requirements Document |
Data Quality Manager Data Quality Analysts |
Business Managers DRM Director |
Information Architects Subject Matter Experts |
10.3 Profile, Analyze, and Assess Data Quality (D) |
Data Quality Assessment Report |
Data Quality Analysts |
Business Managers DRM Director |
Data Stewardship Council |
10.4 Define Data Quality Metrics (P) |
Data Quality Metrics Document |
Data Quality Manager Data Quality Analysts |
Business Managers DRM Director |
Data Stewardship Council |
10.5 Define Data Quality Business Rules (P) |
Data Quality Business Rules |
Data Quality Analysts |
Business Managers DRM Director Data Quality Manager |
Information Architects Subject Matter Experts Data Stewardship Council |
10.6 Test and Validate Data Quality Requirements (D) |
Data Quality Test Cases |
Data Quality Analysts |
Business Managers DRM Director |
Information Architects Subject Matter Experts |
10.7 Set and Evaluate Data Quality Service Levels (P) |
Data Quality Service Levels |
Data Quality Manager |
Business Managers DRM Director |
Data Stewardship Council |
10.8 Continuously Measure and Monitor Data Quality (C) |
Data Quality Reports |
Data Quality Manager |
Business Managers DRM Director |
Data Stewardship Council |
10.9 Manage Data Quality Issues (C) |
Data Quality Issues Log |
Data Quality Manager Data Quality Analysts |
Business Managers DRM Director |
Data Stewardship Council |
10.10 Clean and Correct Data Quality Defects (O) |
Data Quality Defect Resolution Log |
Data Quality Analysts |
Business Managers DRM Director |
Information Architects Subject Matter Experts |
10.11 Design and Implement Operational DQM Procedures (D) |
Operational DQM Procedures |
Data Quality Manager Data Quality Analysts |
Business Managers DRM Director |
Information Architects Subject Matter Experts Data Stewardship Council |
10.12 Monitor Operational DQM Procedures and Performance (C) |
Operational DQM Metrics |
Data Quality Manager Data Quality Analysts |
Business Managers DRM Director |
Data Stewardship Council |
Table 12.2 Data Quality Management Process Summary
12.4.3 Organizational and Cultural Issues
Q1: Is it really necessary to have quality data if there are many processes to change the data into information and use the information for business intelligence purposes?
A1: The business intelligence value chain shows that the quality of the data resource directly impacts the business goals of the organization. The foundation of the value chain is the data resource. Information is produced from the data resource through information engineering, much the same as products are developed from raw materials. The information is used by the knowledge workers in an organization to provide the business intelligence necessary to manage the organization. The business intelligence is used to support the business strategies, which in turn support the business goals. Through the business intelligence value chain, the quality of the data directly impacts how successfully the business goals are met. Therefore, the emphasis for quality must be placed on the data resource, not on the process through information development and business intelligence processes.
Q2: Is data quality really free?
A2: Going back to the second law of thermodynamics, a data resource is an open system. Entropy will continue to increase without any limit, meaning the quality of the data resource will continue to decrease without any limit. Energy must be expended to create and maintain a quality data resource. That energy comes at a cost. Both the initial data resource quality and the maintenance of data resource quality come at a cost. Therefore, data quality is not free.
It is less costly to build quality into the data resource from the beginning, than it is to build it in later. It is also less costly to maintain data quality throughout the life of the data resource, than it is to improve the quality in major steps. When the quality of the data resource is allowed to deteriorate, it becomes far more costly to improve the data quality, and it creates a far greater impact on the business. Therefore, quality is not free; but, it is less costly to build in and maintain. What most people mean when they say that data quality is free is that the cost-benefit ratio of maintaining data quality from the beginning is less than the cost-benefit ratio of allowing the data quality to deteriorate.
Q3: Are data quality issues something new that have surfaced recently with evolving technology?
A3: No. Data quality problems have always been there, even back in the 80-column card days. The problem is getting worse with the increased quantity of data being maintained and the age of the data. The problem is also becoming more visible with processing techniques that are both more powerful and are including a wider range of data. Data that appeared to be high quality in yesterday’s isolated systems now show their low quality when combined into today’s organization-wide analysis processes.
Every organization must become aware of the quality of their data if they are to effectively and efficiently use that data to support the business. Any organization that considers data quality to be a recent issue that can be postponed for later consideration, is putting the survival of their business at risk. The current economic climate is not the time to put the company’s survival on the line by ignoring the quality of their data.
Q4: Is there one thing to do more than any other for ensuring high data quality?
A4: The most important thing is to establish a single enterprise-wide data architecture, then build and maintain all data within that single architecture. A single enterprise-wide data architecture does not mean that all data are stored it one central repository. It does mean that all data are developed and managed within the context of a single enterprise-wide data architecture. The data can be deployed as necessary for operational efficiency.
As soon as any organization allows data to be developed within multiple data architectures, or worse yet, without any data architecture, there will be monumental problems with data quality. Even if an attempt is made to coordinate multiple data architectures, there will be considerable data quality problems. Therefore, the most important thing is to manage all data within a single enterprise-wide data architecture.
12.5 Recommended Reading
The references listed below provide additional reading that support the material presented in Chapter 12. These recommended readings are also included in the Bibliography at the end of the Guide.
Batini, Carlo, and Monica Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006. ISBN 3-540-33172-7. 262 pages.
Brackett, Michael H. Data Resource Quality: Turning Bad Habits into Good Practices. Addison-Wesley, 2000. ISBN 0-201-71306-3. 384 pages.
Deming, W. Edwards. Out of the Crisis. The MIT Press, 2000. ISBN 0262541157. 507 pages.
English, Larry. Improving Data Warehouse And Business Information Quality: Methods For Reducing Costs And Increasing Profits. John Wiley & Sons, 1999. ISBN 0-471-25383-9. 518 pages.
Huang, Kuan-Tsae, Yang W. Lee and Richard Y. Wang. Quality Information and Knowledge. Prentice Hall, 1999. ISBN 0-130-10141-9. 250 pages.
Loshin, David. Enterprise Knowledge Management: The Data Quality Approach. Morgan Kaufmann, 2001. ISBN 0-124-55840-2. 494 pages.
Loshin, David. Master Data Management. Morgan Kaufmann, 2009. ISBN 0123742250. 288 pages.
Maydanchik, Arkady. Data Quality Assessment. Technics Publications, LLC, 2007 ISBN 0977140024. 336 pages.
McGilvray, Danette. Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information. Morgan Kaufmann, 2008. ISBN 0123743699. 352 pages.
Olson, Jack E. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003. ISBN 1-558-60891-5. 294 pages.
Redman, Thomas. Data Quality: The Field Guide. Digital Press, 2001. ISBN 1-555-59251-6. 256 pages.