Chapter 11
Data Quality Management

Effective data management involves a set of interrelated processes that enable an organization to use its data to achieve strategic goals. Data management includes the ability to design data for applications, store and access it securely, share it appropriately, and learn from it to meet strategic and operational objectives. Organizations that are trying to get value from their data need to know that their data is reliable and trustworthy. In other words, that their data is of high quality. But many factors can undermine data quality:

  • Lack of understanding about the effects of poor quality data on organizational success
  • Bad or insufficient planning
  • Isolated design of processes and systems (‘silos’)
  • Inconsistent technical development processes
  • Incomplete documentation and Metadata
  • Lack of standards and governance

Many organizations simply fail to define what makes data fit for purpose in the first place and therefore lack commitment to data quality.

All data management disciplines contribute to the quality of data, and high-quality data that supports the organization should be the goal of all data management disciplines. Because uninformed decisions or actions by anyone who interacts with data can result in poor quality data, producing high-quality data requires cross-functional commitment and coordination. Organizations and teams should be aware of this and should plan for high-quality data, by executing processes and projects in ways that account for the risks related to unexpected or unacceptable conditions in the data.

Because no organization has perfect business processes, perfect technical processes, or perfect data management practices, all organizations experience problems related to the quality of their data. These problems can be very costly. Organizations that formally manage the quality of data have fewer problems than those that leave data quality to chance.

Data quality is becoming a business necessity. The ability to demonstrate that data is of high quality, like the ability to demonstrate that data has been protected properly, is required by some regulations. Business partners and customers expect data to be reliable. An organization that can show that it manages its data well gains a competitive advantage.

This chapter will define key concepts related to data quality and discuss data quality management in relation to overall data management.

Data quality

The term data quality is used to refer both to the characteristics associated with high-quality data and to the processes used to measure or improve the quality of data. This dual usage can be confusing, so it helps to look at both meanings, starting with high-quality data. Later in the chapter we will look at the definition of data quality management.

Data is of high quality to the degree that it meets the expectations and needs of data consumers. That is, if the data is fit for the purposes of data consumers. It is of low quality if it is not fit for those purposes. Data quality is thus dependent on context and on the needs of the data consumers.

One of the challenges in managing the quality of data is that expectations related to quality are not always known. Customers may not articulate them. Often, the people managing data do not even ask about these requirements. But if data is to be reliable and trustworthy, then data management professionals need to better understand the quality requirements of their customers and how to measure them and meet them. The conversation about expectations needs to be ongoing, because requirements change over time as business needs and external forces evolve.

Dimensions of data quality

A data quality dimension is a measurable feature or characteristic of data. The term dimension is used to make the connection to dimensions in the measurement of physical objects (e.g., length, width, height). Data quality dimensions provide a vocabulary for defining data quality requirements. From there, they can be used to define results of initial data quality assessment as well as ongoing measurement. In order to measure the quality of data, an organization needs to establish characteristics that are not only important to business processes (worth measuring) but also measurable and actionable.

Dimensions provide a basis for measurable rules, which themselves should be directly connected to potential risks in critical processes. For example:

  • A risk: If the data in the customer email address field is incomplete, then we will not be able to send product information to our customers via email, and we will lose potential sales.
  • A means of mitigating the risk: We will measure the percentage of customers for whom we have usable email addresses, and we will improve our processes until we have a usable email address for at least 98% of our customers.

Many leading thinkers have written about data quality dimensions.42 While there is not a single, agreed-to set of data quality dimensions, all sets contain common ideas. Dimensions include some characteristics that can be measured objectively (completeness, validity, format conformity) and others that depend on heavily context or on subjective interpretation (usability, reliability, reputation). Whatever names are used, dimensions focus on whether there is enough data (completeness), whether it is correct (accuracy, validity), how well it fits together (consistency, integrity, uniqueness), whether it is up-to-date (timeliness), accessible, usable, and secure.

In 2013, DAMA United Kingdom produced a white paper proposing six core dimensions of data quality. Their set included:

  • Completeness: The proportion of data stored against the potential for 100%.
  • Uniqueness: No entity instance (thing) will be recorded more than once based upon how that thing is identified.
  • Timeliness: The degree to which data represent reality from the required point in time.
  • Validity: Data is valid if it conforms to the syntax (format, type, range) of its definition.
  • Accuracy: The degree to which data correctly describes the ‘real world’ object or event being described.
  • Consistency: The absence of difference, when comparing two or more representations of a thing against a definition.

The DAMA UK white paper also describes other characteristics that have an impact on quality.

  • Usability: Is the data understandable, relevant, accessible, maintainable and at the right level of precision?
  • Timing issues (beyond timeliness itself): Is it stable yet responsive to legitimate change requests?
  • Flexibility: Is the data comparable and compatible with other data? Does it have useful groupings and classifications? Can it be repurposed? Is it easy to manipulate?
  • Confidence: Are data governance, data protection and data security in place? What is the reputation of the data, and is it verified or verifiable?
  • Value: Is there a good cost / benefit case for the data? Is it being optimally used? Does it endanger people’s safety or privacy or the legal responsibilities of the enterprise? Does it support or contradict the corporate image or the corporate message?

Any organization that wants to improve the quality of its data should adopt or develop a set of dimensions through which to measure quality. Coming to consensus about dimensions of quality can provide a starting point for a common vocabulary around quality.

Data quality management

As noted above, sometimes the term data quality is used to refer to the processes used to measure or improve the quality of data. These processes constitute data quality management. While all data management functions have the potential to impact the quality of data, the formal data quality management focuses on helping the organization:

  • Define high-quality data, through DQ standards, rules, and requirements
  • Assess data against those standards and communicate results to stakeholders
  • Monitor and report on the quality of data in applications and data stores
  • Identify issues and advocate for opportunities for improvement

Formal data quality management is similar to continuous quality management for other products. It includes managing data through its lifecycle by setting standards, building quality into the processes that create, transform, and store data, and measuring data against standards. Managing data to this level usually requires a data quality program team. The data quality program team is responsible for engaging both business and technical data management professionals and driving the work of applying quality management techniques to data to ensure that data is fit for consumption for a variety of purposes.

The team will likely be involved with a series of projects through which they can establish processes and best practices while addressing high priority data issues. Because managing the quality of data involves managing the data lifecycle, a data quality program will also have operational responsibilities related to data usage. For example, reporting on data quality levels, and engaging in the analysis, quantification, and prioritization of data issues.

The team is also responsible for working with those who need data to do their jobs to ensure the data meets their needs, and working with those who create, update, or delete data in the course of their jobs to ensure they are properly handling the data. Data quality depends on all who interact with the data, not just data management professionals.

As is the case with data governance and with data management as a whole, data quality management is a program, not a project. It will include both project and maintenance work, along with a commitment to communications and training. Most importantly, the long-term success of data quality improvement program depends on getting an organization to change its culture and adopt a quality mindset. As stated in The Leader’s Data Manifesto: fundamental, lasting change requires committed leadership and involvement from people at all levels in an organization. People who use data to do their jobs – which in most organizations is a very large percentage of employees – need to drive change. And one of the most critical changes to focus on is how their organizations manage and improve the quality of their data.43

DAMA’s data management principles assert that data management is management of the lifecycle of data and that managing data means managing the quality of data. Throughout the data lifecycle, data quality management activities help an organization define and measure expectations related to its data. These expectations may change over time as organizational uses of data evolve (see Figure 27).

Data quality and other data management functions

As noted earlier, all areas of data management have the potential to affect the quality of data. Data governance and stewardship, data modeling, and Metadata management have direct effects on defining what high-quality data looks like. If these are not executed well, it is very difficult to have reliable data. The three are related in that they establish standards, definitions, and rules related to data. Data quality is about meeting expectations. Collectively, these describe a set of common expectations for quality.

The quality of data is based on how well it meets the requirements of data consumers. Having a robust process by which data is defined supports the ability of an organization to formalize and document the standards and requirements by which the quality of data can be measured.

Metadata defines what the data represents. Data Stewardship and the data modeling processes are sources of critical Metadata. Well-managed Metadata can also support the effort to improve the quality of data. A Metadata repository can house results of data quality measurements so that these are shared across the organization and the data quality team can work toward consensus about priorities and drivers for improvement.

A data quality program is more effective when part of a data governance program, not only because Data Stewardship is often aligned with data governance, but also because data quality issues are a primary reason for establishing enterprise-wide data governance. Incorporating data quality efforts into the overall governance effort enables the data quality program team to work with a range of stakeholders and enablers:

  • Risk and security personnel who can help identify data-related organizational vulnerabilities
  • Business process engineering and training staff who can help teams implement process improvements that increase efficiency and result in data more suitable for downstream uses
  • Business and operational data stewards, and data owners who can identify critical data, define standards and quality expectations, and prioritize remediation of data issues

A Governance Organization can accelerate the work of a data quality program by:

  • Setting priorities
  • Developing and maintaining standards and policies for data quality
  • Establishing communications and knowledge-sharing mechanisms
  • Monitoring and reporting on performance and on data quality measurements
  • Sharing data quality inspection results to build awareness and identify opportunities for improvement

Governance programs also often have responsibility for Master Data Management and Reference Data Management. It is worth noting that Master Data Management and Reference Data Management are both examples of processes focused on curating particular kinds of data for purposes of ensuring its quality. Simply labeling a data set “Master Data” implies certain expectations about its content and reliability.

Data quality and regulation

As noted in the chapter introduction, demonstrable data quality, like demonstrable data security, provides a competitive advantage. Customers and business partners alike expect and are beginning to demand complete and accurate data. Data quality is also a regulatory requirement in some cases. Data management practices can be audited. Regulations that are directly connected with data quality practices include examples noted previously:

  • Sarbanes-Oxley (US) which focuses on the accuracy and validity of financial transactions
  • Solvency II (EU) which focuses on data lineage and quality of data underpinning risk models
  • General Data Protection Regulation (GDPR, EU) asserts that personal data must be accurate, and where necessary, kept up-to-date. Reasonable steps should be taken to erase or rectify inaccurate personal data.
  • Personal Information Protection and Electronic Documents Act (PIPEDA, Canada) asserts that personal data must be as accurate, complete, and up-to-date for its purposes

It is worth noting that, even where data quality requirements are not specifically called out, the ability to protect personal data depends in part on that data being of high quality.

Data quality improvement cycle

Most approaches to improving data quality are based on the techniques of quality improvement in the manufacture of physical products.44 In this paradigm, data is understood as the product of a set of processes. At its simplest, a process is defined as a series of steps that turns inputs into outputs. A process that creates data may consist of one step (data collection) or many steps: data collection, integration into a data warehouse, aggregation in a data mart, etc. At any step, data can be negatively affected. It can be collected incorrectly, dropped or duplicated between systems, aligned or aggregated incorrectly, etc.

Improving data quality requires the ability to assess the relationship between inputs and outputs to ensure inputs meet the requirements of the process and outputs conform to expectations. Since outputs from one process become inputs to other processes, requirements must be defined along the whole data chain.

A general approach to data quality improvement, shown in Figure 28, is a version of the Shewhart / Deming cycle45. Based on the scientific method, the Shewhart / Deming cycle is a problem-solving model known as ‘plan-do-check-act’. Improvement comes through a defined set of steps. The condition of data must be measured against standards and, if it does not meet standards, root cause(s) of the discrepancy from standards must be identified and remediated. Root causes may be found in any of the steps of the process, technical or non-technical. Once remediated, data should be monitored to ensure that it continues to meet requirements.

For a given data set, a data quality improvement cycle begins by identifying the data that does not meet data consumers’ requirements and data issues that are obstacles to the achievement of business objectives. Data needs to be assessed against key dimensions of quality and known business requirements. Root causes of issues will need to be identified so that stakeholders can understand the costs of remediation and the risks of not remediating the issues. This work is often done in conjunction with Data Stewards and other stakeholders.

In the Plan stage, the data quality team assesses the scope, impact, and priority of known issues, and evaluates alternatives to address them. This plan should be based on a solid foundation of analysis of the root causes of issues. From knowledge of the causes and the impact of the issues, cost / benefit can be understood, priority can be determined, and a basic plan can be formulated to address them.

In the Do stage, the DQ team leads efforts to address the root causes of issues and plan for ongoing monitoring of data. For root causes that are based on non-technical processes, the DQ team can work with process owners to implement changes. For root causes that require technical changes, the DQ team should work with technical teams and ensure that requirements are implemented correctly and that no unintended errors are introduced by technical changes.

The Check stage involves actively monitoring the quality of data as measured against requirements. As long as data meets defined thresholds for quality, additional actions are not required. The processes will be considered under control and meeting business requirements. However, if the data falls below acceptable quality thresholds, then additional action must be taken to bring it up to acceptable levels.

The Act stage is for activities to address and resolve emerging data quality issues. The cycle restarts, as the causes of issues are assessed and solutions proposed. Continuous improvement is achieved by starting a new cycle. New cycles begin as:

  • Existing measurements fall below thresholds
  • New data sets come under investigation
  • New data quality requirements emerge for existing data sets
  • Business rules, standards, or expectations change

Establishing criteria for data quality at the beginning of a process or system build is one sign of a mature data management organization. Doing so takes governance and discipline, as well as cross-functional collaboration.

Building quality into the data management processes from the beginning costs less than retrofitting it. Maintaining high-quality data throughout the data lifecycle is less risky than trying to improve quality in an existing process. It also creates a far lower impact on the organization.

It is best to do things right the first time, though few organizations have the luxury of doing so. Even if they do, managing quality is an ongoing process. Changing demands and organic growth over time can cause data quality issues that may be costly if unchecked, but can be nipped in the bud if an organization is attentive to the potential risks.

Data quality and leadership commitment

Data quality issues can emerge at any point in the data lifecycle, from creation to disposal. When investigating root causes, analysts should look for potential culprits, like problems with data entry, data processing, system design, and manual intervention in automated processes. Many issues will have multiple causes and contributing factors (especially if people have created ways to work around them). These causes of issues also imply that data quality issues can be prevented through:

  • Improvement to interface design
  • Testing of data quality rules as part of processing
  • A focus on data quality within system design
  • Strict controls on manual intervention in automated processes

Obviously, preventative tactics should be used. However, common sense says and research indicates that many data quality problems are caused by a lack of organizational commitment to high-quality data, which itself stems from a lack of leadership, in the form of both governance and management.

Every organization has information and data assets that are of value to its operations. Indeed, operations depend on the ability to share information. Despite this, few organizations manage these assets with rigor.

Many governance and information asset programs are driven solely by compliance, rather than by the potential value of data as an asset. A lack of recognition on the part of leadership means a lack of commitment within an organization to managing data as an asset, including managing its quality.46 Barriers to the effective management of data quality (see Figure 29) include:47

  • Lack of awareness on the part of leadership and staff
  • Lack of business governance
  • Lack of leadership and management
  • Difficulty in justification of improvements
  • Inappropriate or ineffective instruments to measure value

These barriers have negative effects on customer experience, productivity, morale, organizational effectiveness, revenue, and competitive advantage. They increase costs of running the organization and introduce risks as well.

As with understanding the root cause of any problem, recognition of these barriers – the root causes of poor quality data – gives an organization insight into how to improve its quality. If an organization realizes that it does not have strong business governance, ownership, and accountability, then it can address the problem by establishing business governance, ownership, and accountability. If leadership sees that the organization does not know how to put information to work, then leadership can put processes in place so that the organization can learn how to do so.

Recognition of a problem is the first step to solving it. Actually solving problems takes a lot of work. Most of the barriers to managing information as an asset are cultural. Addressing them requires a formal process of organizational change management.

Figure 29: Barriers to Managing Information as a Business Asset (DMBOK2, p. 467)48

Organization and cultural change

The quality of data will not be improved through a collection of tools and concepts, but through a mindset that helps employees and stakeholders to account for the quality of data needed to serve their organization and its customers. Getting an organization to be conscientious about data quality often requires significant cultural change. Such change requires vision and leadership.

The first step is promoting awareness about the role and importance of data to the organization and defining the characteristics of high-quality data. All employees must act responsibly and raise data quality issues, ask for good quality data as data consumers, and provide quality information to others. Every person who touches the data can impact the quality of that data. Data quality is not just the responsibility of a DQ team, a data governance team, or IT group.

Just as the employees need to understand the cost to acquire a new customer or retain an existing customer, they also need to know the organizational costs of poor quality data, as well as the conditions that cause data to be of poor quality. For example, if customer data is incomplete, a customer may receive the wrong product, creating direct and indirect costs to an organization. Not only will the customer return the product, but he or she may call and complain, using call center time, and creating the potential for reputational damage to the organization. If customer data is incomplete because the organization has not established clear requirements, then everyone who uses this data has a stake in clarifying requirements and following standards.

Ultimately, employees need to think and act differently if they are to produce better quality data and manage data in ways that ensures quality. This requires not only training but also reinforcement by committed leadership.

What you need to know

  • Poor quality data is costly. High-quality data has many benefits.
  • The quality of data can be managed and improved, just as the quality of physical products can be managed and improved.
  • The cost of getting data right the first time is lower than the cost of getting data wrong and fixing it.
  • Data quality management requires a wide skill set and organizational commitment.
  • Organizational commitment to quality requires committed leadership.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset