Effective data management involves a set of interrelated processes that enable an organization to use its data to achieve strategic goals. Data management includes the ability to design data for applications, store and access it securely, share it appropriately, and learn from it to meet strategic and operational objectives. Organizations that are trying to get value from their data need to know that their data is reliable and trustworthy. In other words, that their data is of high quality. But many factors can undermine data quality:
Many organizations simply fail to define what makes data fit for purpose in the first place and therefore lack commitment to data quality.
All data management disciplines contribute to the quality of data, and high-quality data that supports the organization should be the goal of all data management disciplines. Because uninformed decisions or actions by anyone who interacts with data can result in poor quality data, producing high-quality data requires cross-functional commitment and coordination. Organizations and teams should be aware of this and should plan for high-quality data, by executing processes and projects in ways that account for the risks related to unexpected or unacceptable conditions in the data.
Because no organization has perfect business processes, perfect technical processes, or perfect data management practices, all organizations experience problems related to the quality of their data. These problems can be very costly. Organizations that formally manage the quality of data have fewer problems than those that leave data quality to chance.
Data quality is becoming a business necessity. The ability to demonstrate that data is of high quality, like the ability to demonstrate that data has been protected properly, is required by some regulations. Business partners and customers expect data to be reliable. An organization that can show that it manages its data well gains a competitive advantage.
This chapter will define key concepts related to data quality and discuss data quality management in relation to overall data management.
Data quality
The term data quality is used to refer both to the characteristics associated with high-quality data and to the processes used to measure or improve the quality of data. This dual usage can be confusing, so it helps to look at both meanings, starting with high-quality data. Later in the chapter we will look at the definition of data quality management.
Data is of high quality to the degree that it meets the expectations and needs of data consumers. That is, if the data is fit for the purposes of data consumers. It is of low quality if it is not fit for those purposes. Data quality is thus dependent on context and on the needs of the data consumers.
One of the challenges in managing the quality of data is that expectations related to quality are not always known. Customers may not articulate them. Often, the people managing data do not even ask about these requirements. But if data is to be reliable and trustworthy, then data management professionals need to better understand the quality requirements of their customers and how to measure them and meet them. The conversation about expectations needs to be ongoing, because requirements change over time as business needs and external forces evolve.
Dimensions of data quality
A data quality dimension is a measurable feature or characteristic of data. The term dimension is used to make the connection to dimensions in the measurement of physical objects (e.g., length, width, height). Data quality dimensions provide a vocabulary for defining data quality requirements. From there, they can be used to define results of initial data quality assessment as well as ongoing measurement. In order to measure the quality of data, an organization needs to establish characteristics that are not only important to business processes (worth measuring) but also measurable and actionable.
Dimensions provide a basis for measurable rules, which themselves should be directly connected to potential risks in critical processes. For example:
Many leading thinkers have written about data quality dimensions.42 While there is not a single, agreed-to set of data quality dimensions, all sets contain common ideas. Dimensions include some characteristics that can be measured objectively (completeness, validity, format conformity) and others that depend on heavily context or on subjective interpretation (usability, reliability, reputation). Whatever names are used, dimensions focus on whether there is enough data (completeness), whether it is correct (accuracy, validity), how well it fits together (consistency, integrity, uniqueness), whether it is up-to-date (timeliness), accessible, usable, and secure.
In 2013, DAMA United Kingdom produced a white paper proposing six core dimensions of data quality. Their set included:
The DAMA UK white paper also describes other characteristics that have an impact on quality.
Any organization that wants to improve the quality of its data should adopt or develop a set of dimensions through which to measure quality. Coming to consensus about dimensions of quality can provide a starting point for a common vocabulary around quality.
Data quality management
As noted above, sometimes the term data quality is used to refer to the processes used to measure or improve the quality of data. These processes constitute data quality management. While all data management functions have the potential to impact the quality of data, the formal data quality management focuses on helping the organization:
Formal data quality management is similar to continuous quality management for other products. It includes managing data through its lifecycle by setting standards, building quality into the processes that create, transform, and store data, and measuring data against standards. Managing data to this level usually requires a data quality program team. The data quality program team is responsible for engaging both business and technical data management professionals and driving the work of applying quality management techniques to data to ensure that data is fit for consumption for a variety of purposes.
The team will likely be involved with a series of projects through which they can establish processes and best practices while addressing high priority data issues. Because managing the quality of data involves managing the data lifecycle, a data quality program will also have operational responsibilities related to data usage. For example, reporting on data quality levels, and engaging in the analysis, quantification, and prioritization of data issues.
The team is also responsible for working with those who need data to do their jobs to ensure the data meets their needs, and working with those who create, update, or delete data in the course of their jobs to ensure they are properly handling the data. Data quality depends on all who interact with the data, not just data management professionals.
As is the case with data governance and with data management as a whole, data quality management is a program, not a project. It will include both project and maintenance work, along with a commitment to communications and training. Most importantly, the long-term success of data quality improvement program depends on getting an organization to change its culture and adopt a quality mindset. As stated in The Leader’s Data Manifesto: fundamental, lasting change requires committed leadership and involvement from people at all levels in an organization. People who use data to do their jobs – which in most organizations is a very large percentage of employees – need to drive change. And one of the most critical changes to focus on is how their organizations manage and improve the quality of their data.43
DAMA’s data management principles assert that data management is management of the lifecycle of data and that managing data means managing the quality of data. Throughout the data lifecycle, data quality management activities help an organization define and measure expectations related to its data. These expectations may change over time as organizational uses of data evolve (see Figure 27).
Data quality and other data management functions
As noted earlier, all areas of data management have the potential to affect the quality of data. Data governance and stewardship, data modeling, and Metadata management have direct effects on defining what high-quality data looks like. If these are not executed well, it is very difficult to have reliable data. The three are related in that they establish standards, definitions, and rules related to data. Data quality is about meeting expectations. Collectively, these describe a set of common expectations for quality.
The quality of data is based on how well it meets the requirements of data consumers. Having a robust process by which data is defined supports the ability of an organization to formalize and document the standards and requirements by which the quality of data can be measured.
Metadata defines what the data represents. Data Stewardship and the data modeling processes are sources of critical Metadata. Well-managed Metadata can also support the effort to improve the quality of data. A Metadata repository can house results of data quality measurements so that these are shared across the organization and the data quality team can work toward consensus about priorities and drivers for improvement.
A data quality program is more effective when part of a data governance program, not only because Data Stewardship is often aligned with data governance, but also because data quality issues are a primary reason for establishing enterprise-wide data governance. Incorporating data quality efforts into the overall governance effort enables the data quality program team to work with a range of stakeholders and enablers:
A Governance Organization can accelerate the work of a data quality program by:
Governance programs also often have responsibility for Master Data Management and Reference Data Management. It is worth noting that Master Data Management and Reference Data Management are both examples of processes focused on curating particular kinds of data for purposes of ensuring its quality. Simply labeling a data set “Master Data” implies certain expectations about its content and reliability.
Data quality and regulation
As noted in the chapter introduction, demonstrable data quality, like demonstrable data security, provides a competitive advantage. Customers and business partners alike expect and are beginning to demand complete and accurate data. Data quality is also a regulatory requirement in some cases. Data management practices can be audited. Regulations that are directly connected with data quality practices include examples noted previously:
It is worth noting that, even where data quality requirements are not specifically called out, the ability to protect personal data depends in part on that data being of high quality.
Data quality improvement cycle
Most approaches to improving data quality are based on the techniques of quality improvement in the manufacture of physical products.44 In this paradigm, data is understood as the product of a set of processes. At its simplest, a process is defined as a series of steps that turns inputs into outputs. A process that creates data may consist of one step (data collection) or many steps: data collection, integration into a data warehouse, aggregation in a data mart, etc. At any step, data can be negatively affected. It can be collected incorrectly, dropped or duplicated between systems, aligned or aggregated incorrectly, etc.
Improving data quality requires the ability to assess the relationship between inputs and outputs to ensure inputs meet the requirements of the process and outputs conform to expectations. Since outputs from one process become inputs to other processes, requirements must be defined along the whole data chain.
A general approach to data quality improvement, shown in Figure 28, is a version of the Shewhart / Deming cycle45. Based on the scientific method, the Shewhart / Deming cycle is a problem-solving model known as ‘plan-do-check-act’. Improvement comes through a defined set of steps. The condition of data must be measured against standards and, if it does not meet standards, root cause(s) of the discrepancy from standards must be identified and remediated. Root causes may be found in any of the steps of the process, technical or non-technical. Once remediated, data should be monitored to ensure that it continues to meet requirements.
For a given data set, a data quality improvement cycle begins by identifying the data that does not meet data consumers’ requirements and data issues that are obstacles to the achievement of business objectives. Data needs to be assessed against key dimensions of quality and known business requirements. Root causes of issues will need to be identified so that stakeholders can understand the costs of remediation and the risks of not remediating the issues. This work is often done in conjunction with Data Stewards and other stakeholders.
In the Plan stage, the data quality team assesses the scope, impact, and priority of known issues, and evaluates alternatives to address them. This plan should be based on a solid foundation of analysis of the root causes of issues. From knowledge of the causes and the impact of the issues, cost / benefit can be understood, priority can be determined, and a basic plan can be formulated to address them.
In the Do stage, the DQ team leads efforts to address the root causes of issues and plan for ongoing monitoring of data. For root causes that are based on non-technical processes, the DQ team can work with process owners to implement changes. For root causes that require technical changes, the DQ team should work with technical teams and ensure that requirements are implemented correctly and that no unintended errors are introduced by technical changes.
The Check stage involves actively monitoring the quality of data as measured against requirements. As long as data meets defined thresholds for quality, additional actions are not required. The processes will be considered under control and meeting business requirements. However, if the data falls below acceptable quality thresholds, then additional action must be taken to bring it up to acceptable levels.
The Act stage is for activities to address and resolve emerging data quality issues. The cycle restarts, as the causes of issues are assessed and solutions proposed. Continuous improvement is achieved by starting a new cycle. New cycles begin as:
Establishing criteria for data quality at the beginning of a process or system build is one sign of a mature data management organization. Doing so takes governance and discipline, as well as cross-functional collaboration.
Building quality into the data management processes from the beginning costs less than retrofitting it. Maintaining high-quality data throughout the data lifecycle is less risky than trying to improve quality in an existing process. It also creates a far lower impact on the organization.
It is best to do things right the first time, though few organizations have the luxury of doing so. Even if they do, managing quality is an ongoing process. Changing demands and organic growth over time can cause data quality issues that may be costly if unchecked, but can be nipped in the bud if an organization is attentive to the potential risks.
Data quality and leadership commitment
Data quality issues can emerge at any point in the data lifecycle, from creation to disposal. When investigating root causes, analysts should look for potential culprits, like problems with data entry, data processing, system design, and manual intervention in automated processes. Many issues will have multiple causes and contributing factors (especially if people have created ways to work around them). These causes of issues also imply that data quality issues can be prevented through:
Obviously, preventative tactics should be used. However, common sense says and research indicates that many data quality problems are caused by a lack of organizational commitment to high-quality data, which itself stems from a lack of leadership, in the form of both governance and management.
Every organization has information and data assets that are of value to its operations. Indeed, operations depend on the ability to share information. Despite this, few organizations manage these assets with rigor.
Many governance and information asset programs are driven solely by compliance, rather than by the potential value of data as an asset. A lack of recognition on the part of leadership means a lack of commitment within an organization to managing data as an asset, including managing its quality.46 Barriers to the effective management of data quality (see Figure 29) include:47
These barriers have negative effects on customer experience, productivity, morale, organizational effectiveness, revenue, and competitive advantage. They increase costs of running the organization and introduce risks as well.
As with understanding the root cause of any problem, recognition of these barriers – the root causes of poor quality data – gives an organization insight into how to improve its quality. If an organization realizes that it does not have strong business governance, ownership, and accountability, then it can address the problem by establishing business governance, ownership, and accountability. If leadership sees that the organization does not know how to put information to work, then leadership can put processes in place so that the organization can learn how to do so.
Recognition of a problem is the first step to solving it. Actually solving problems takes a lot of work. Most of the barriers to managing information as an asset are cultural. Addressing them requires a formal process of organizational change management.
Figure 29: Barriers to Managing Information as a Business Asset (DMBOK2, p. 467)48
Organization and cultural change
The quality of data will not be improved through a collection of tools and concepts, but through a mindset that helps employees and stakeholders to account for the quality of data needed to serve their organization and its customers. Getting an organization to be conscientious about data quality often requires significant cultural change. Such change requires vision and leadership.
The first step is promoting awareness about the role and importance of data to the organization and defining the characteristics of high-quality data. All employees must act responsibly and raise data quality issues, ask for good quality data as data consumers, and provide quality information to others. Every person who touches the data can impact the quality of that data. Data quality is not just the responsibility of a DQ team, a data governance team, or IT group.
Just as the employees need to understand the cost to acquire a new customer or retain an existing customer, they also need to know the organizational costs of poor quality data, as well as the conditions that cause data to be of poor quality. For example, if customer data is incomplete, a customer may receive the wrong product, creating direct and indirect costs to an organization. Not only will the customer return the product, but he or she may call and complain, using call center time, and creating the potential for reputational damage to the organization. If customer data is incomplete because the organization has not established clear requirements, then everyone who uses this data has a stake in clarifying requirements and following standards.
Ultimately, employees need to think and act differently if they are to produce better quality data and manage data in ways that ensures quality. This requires not only training but also reinforcement by committed leadership.
What you need to know