A Process for Data Quality

,

There is no one-size-fits-all model for data quality. When creating one, it is necessary to take into consideration a company's culture, the MDM approach being implemented, how multiple LOBs interact with each other, the maturity level of the data governance and stewardship teams, the degree of management engagement and sponsorship, technology resources, and personnel skills.

Figure 6.1 depicts the major roles involved in a data quality process and how they interact with one another to create a flexible and effective model to address data quality issues. The arrows represent events or dependencies, while the numbers represent the sequence of activities.

Figure 6.1 A Data Quality Process

img

A description of each of the elements presented in Figure 6.1 is provided next.

Drivers

Drivers are essentially the initiators of a data quality activity and the means by which they bring data quality issues to proper attention. A company with a mature data quality practice should be able to support a multitude of drivers. Not only that, it should also demand that everyone across the company participate in improving the overall quality of the data. After all, data quality is everyone's responsibility.

Continuous training, both formal and informal, is essential to achieve everyone's participation and strengthen a culture of focus on data quality. Actually, several studies have shown informal learning can be more effective than formal learning. With that in mind, companies need to find creative ways to disseminate information, such as mentoring and coaching programs, brown bag sessions, lessons-learned reviews, and so on. Technology should be leveraged to increase collaboration. Social media, generically speaking, still has a long way to go in the workplace, but needs to be considered as a mechanism to increase collaboration and promote information sharing. There are multiple categories of social media applications, such as: blogs, microblogging, social networking, wikis, webcasts, podcasts, and more. Striking the balance of what resources to use and to what extent is a challenge. Companies in certain industries may have less difficulty in adopting some of those applications. As an example, it is likely that high-tech companies are more prepared than health-care companies to embrace and spread the use of social media in general. When defining what will be most effective in a company, it is necessary to take into consideration the company culture, computer resources, and human resources and skills.

Essentially, data quality initiatives fall into two categories: (1) reactive and (2) proactive. In general terms, proactive initiatives are measures established to avoid problems from happening or getting worse, while reactive initiatives are measures adopted after the problem has already occurred and needs correction. Drivers throughout the company, acting on their particular roles and driven by specific business needs, will either be reacting to data quality problems, or will be proactively preventing new problems from happening or existing problems from getting worse.

Users following a particular business process for data entry, for example, may detect irregularities with the data due to a system bug, a bad practice, or weak enforcement of business rules. The users will not necessarily know the root cause of the problem or the best way to resolve it, and that is expected. But they need a mechanism for presenting the problem and requesting a correction. Most companies will implement a type of trouble ticket system that will allow users to communicate the problem they see. These trouble tickets are then categorized and routed to a suitable team for proper actions. In this scenario, the problem entered by the user in the trouble ticket becomes the requirement or problem statement represented by arrow number 1 in Figure 6.1.

Trouble ticket is just one mechanism by which a company should support requests for data quality improvements. Special projects and certain activities commonly carried out are very likely to have data management impacts, and should be supported with proper engagement of the data quality team according to pre-established level agreements. Here are some examples of activities that will require close data quality participation:

  • Migrating data from one system into another due to mergers and acquisitions or simply to consolidate multiple systems and eliminate redundancy.
  • Changes in system functionality, such as a new tax calculation engine may require a more complete, consistent, and accurate postal code representation than previously.
  • Regulatory compliance, such as new financial reporting rules, Sarbanes-Oxley Act (SOX), U.S. Patriot Act, or Basel II.
  • Security compliance, such as government data requiring particular access control rules.

The drivers behind the previous activities will vary depending on the organizational structure. Even within a company, the same category of change could come from different organizations. For example, an IT-initiated system consolidation task may be the driver to a data migration activity, while a merger or acquisition is a business-initiated activity that will also lead to a data migration effort. In another example, a regulatory compliance requirement can come either from a financial organization or from an enterprise-wide data governance initiative.

Nonetheless, what is important is that the data quality process supports all data-driven requests, no matter what the driver is. Remember, the goal is to create a culture of focus on data quality. If a company is too selective about its drivers, it will create skepticism regarding its true objectives, which could ultimately lead to a company-wide data management failure.

When evaluating the requirements of one driver, it is necessary to consider the implications on the entire company. This is where the importance of establishing a data quality forum comes in (which is described next). In essence, a driver, through some established process and procedure, will provide the forum with a set of requirements or problem statement.

Data Quality (DQ) Forum

MDM is about bringing data together in a meaningful and fit-for-purpose way. An important consideration for MDM is assessing the effect a particular data change will have on all dependent parties in the company. A data quality forum can efficiently analyze the impact of data changes if it encompasses liaisons from the multiple LOBs that are contingent on that particular source of data.

Most of the activities are carried out by the lead of the forum, also known as the data quality lead. Depending on the size of the company, the data quality lead may need help from other data quality specialists to form a full-time data quality team. The representatives of the multiple LOBs act as liaisons meeting the data quality lead/team on a regular basis to evaluate requirements, review solutions, and issue approvals. Figure 6.2 shows the data quality forum.

Figure 6.2 The Data Quality (DQ) Forum

img

The data quality lead/team will be in charge of most of the following activities, but obviously will bring in the liaisons according to agreed expectations, roles, and responsibilities:

  • Analyze and review requirements with associated business driver(s).
  • Evaluate data governance rules, policies, and procedures and make sure the required changes are not in violation of any of them, or if a particular change might be necessary to existing data governance directives.
  • Perform data analysis and profiling to fully understand data issues and provide alternatives for resolution.
  • In addition to data analysis results, also take into consideration data governance rules, policies, and procedures when architecting potential solutions.
  • Review solutions and determine the impact of the changes with data governance, the multiple LOBs, and other potential stakeholders.
  • Obtain proper approvals and carry out execution of the changes with necessary team(s). The resolution could be a combination of activities to be performed by the business only, by IT only, or both.

Figure 6.3 shows a flowchart depicting the sequence of the major activities performed by the data quality forum as a whole. It should be clear when the LOBs' liaisons are involved or not according to the nature of the activity itself and the explanation presented earlier.

Figure 6.3 Data Quality Forum Flowchart

img

Since data quality and data governance are tightly coupled, it is beneficial for the data quality lead of the forum to be a member of the data governance council, as well. This way, when a data quality issue requires data governance attention, the data quality lead can properly convey the issues and advise on alternatives for resolution.

Controls/Data Governance

Companies have multiple types of controls, such as business rules, policies, procedures, standards, and system constraints. Furthermore, company culture and amount of regulation in a particular industry will also dictate the quantity of controls imposed. However, the fact that companies have a lot of controls doesn't necessarily mean they are more mature from an MDM perspective. The maturity comes from having data governance more engaged on the multitude of controls across the company. As data governance matures, more controls are likely to be added. Put simply, more controls don't necessarily indicate more MDM maturity, but more MDM maturity will lead to more controls.

Companies will be subject to controls regardless of whether they have a data governance program. Data quality management is impacted by the controls, not necessarily by who manages them. Therefore, without a data governance team managing rules, policies, procedures, and standards, it means the data quality team will need to be responsible for validating all changes with the many diverse teams imposing or subject to all the controls. But forcing such onus on the data quality team is not efficient, because a data governance role is much better equipped to achieve this task.

The bottom line is this: Data governance and data quality should be complementing entities, with data governance a bit more strategic and data quality a bit more tactical, but both growing together as data management matures within the company. Plus, the higher the maturity in data governance, the less effort is needed in data quality, since better-governed data needs less correction. Figure 6.4 depicts this relationship.

Figure 6.4 Data Governance versus Data Quality Management

img

Looking at Figure 6.4, keep in mind that data quality management will always be required no matter how mature a company's data governance program is. As a matter of fact, certain components of data quality will increase as governance efforts mature, such as monitoring, dashboards, and scorecards. One unit increase of effort on data governance doesn't mean one unit decrease of effort on data quality management. As companies become better governed and more controlled with quality engineered into the processes, the data quality initiatives are more predictable and require less work.

Obviously, a data governance program can only go so far. Eventually, it will end up reaching a point of diminishing returns. But most companies are far from that point as the majority of them still struggle with implementing data governance. For a more in-depth data governance review, recall that Chapter 4 covers this topic in detail.

Data Analysts

Many companies underestimate the amount of data analysis they need to perform before writing their requirements or proposing a solution to a particular data problem. Data projects need to be supported by data.

The Storage Networking Industry Association (SNIA) has the following definition of information lifecycle management (ILM), sometimes also referred to as data lifecycle management:

Information Lifecycle Management: the policies, processes, practices, services and tools used to align the business value of information with the most appropriate and cost effective infrastructure from the time information is created through its final disposition. Information is aligned with business requirements through management policies and service levels associated with applications, metadata, and data.

The life cycle aforementioned would normally include creation, distribution, use, maintenance, and disposition of data. Notice the definition is very rigorous regarding governance, documentation, and business purpose of the data. These are all solid concepts, which would work perfectly in a company if they were being followed meticulously from the beginning. But that is very unlikely. The majority of companies do not have a good understanding or documentation about all their data elements and business purposes.

To make matters worse, an MDM project is normally the result of many consolidations of multiple repositories, or bringing in data due to mergers and acquisitions. All these activities will lead to fragmented, inconsistent, non-standardized, and potentially inaccurate data elements.

In his book, Tony Fisher provides a more encompassing data management life cycle.1 He calls for five steps: (1) discover, (2) design, (3) enable, (4) maintain, and (5) archive.

The last four steps in this lifecycle model (design, enable, maintain, and archive) are very similar to the more traditional steps: create, distribute, use, maintain, and dispose. But the very important first step, discover, is a clear indication that companies and experts have recognized the significance of understanding the data before deciding what to do with it.

Data profiling is one key component of data discovery. Therefore, data profiling should be sought at multiple stages of the data quality management program. Sometimes data profiling is necessary to help the business define or clarify certain business rules, and consequently, their requirements for a data quality improvement project. For example, contact duplication may be impacting the business by creating inefficiencies in business processes, poor business intelligence decisions, and/or increasing marketing campaign costs. However, simply stating in the requirements that contact data needs to be consolidated may not be sufficient. The actual definition of what a duplicate really is has to be stated by the business. It could be contacts with the same phone number, or the same e-mail, same first name/last name, belong to the same company, or combinations of these. Before profiling the data, it is highly likely that the business doesn't understand the true degree of completeness of these attributes in the system.

In data projects, there is a balance of how much business requirement is needed and how much data analysis is necessary to support the requirements and achieve the objectives. One can't assume there will be a clean set of requirements ready to be executed when there is a chance the business has no clue as to the condition of the data. When the data quality forum gets a request from a driver, it is important for the data quality lead/team to work with the data analysts to ensure a comprehensive data profiling exercise is completed before the requirements are finalized. After all, it is very likely data itself will support or even dictate the business rules.

A seemingly simple question: Who performs data profiling? Data analysts from the business side—or data analysts from IT? The data must be evaluated from a business perspective, because data must have a business purpose to be useful. If a business purpose does not exist, one needs to question the request for the data quality improvement.

A challenging situation can occur, however, when the person with the business perspective may not have the proper technical skills to do the data profiling, or potentially may not have the proper data or system access permissions.

Several vendors provide tools that tremendously facilitate the data profiling activities and allow for collaboration between the business and IT. Some tools are easier for a business person to use than others. A careful evaluation should be done before deciding what tool to buy. Some MDM solutions will have better data profiling/data quality capabilities than others. As stated previously in the book, it may be difficult to find a single vendor providing all the MDM components to satisfaction.

Data access can also be a limiting factor. Business users obviously have access to the data they need, but the access is normally limited to a front-end screen that gives them a view to a small number of elements or records at a time. This type of access is usually not sufficient when running data profiling. Most likely, it will be necessary to have back-end access to the repository to be able to query data in bulk. Certain companies are very sensitive about providing this type of access to business teams, even on a read-only mode.

Considering all these factors, here are the options regarding the selection of a data analyst:

  • Business person with technical abilities and proper data access. This is the most efficient scenario because the business data analyst can more effectively work with other business teams regarding the data profiling needs and results, and quickly make the proper adjustments.
  • Technical person with business abilities and proper data access. This is the second best option assuming there is somebody who can fill this role. Normally, technical people don't necessarily have the inclination to be more engaged on the business side of the company, but this role can be fostered.
  • Business person working together with a technical person with proper data access. This is likely what most companies do, although it is probably the most inefficient method. Data profiling can be a very iterative process. Normally, there is no clearly defined path to follow. It is common for the results of one data analysis to dictate what needs to be analyzed next. If there is a dependency on the technical person to collect the data and the business person to do the analysis and make the decisions, a lot of time will be spent back and forth between the two of them.

When deciding on a data analysis team, take into consideration the company structure, culture, and political issues. But this shouldn't deter you from looking toward the best interests of the company. Challenging the status quo is a healthy practice in any company as long as it is done in a constructive manner and with an end goal of improving the business. Tough choices require courage—hence the use of Aristotle's quote at the beginning of this chapter.

Design Team

Once the problem is clearly understood and the requirements clearly stated, it is time to move on to designing solutions. As stated earlier in this chapter, data quality requests can either be reactive or proactive. As companies mature, it is expected the total number of data quality issues will go down. Not only that, the number of quality requests should shift from the reactive to the proactive category since mature companies are better about predicting potential issues with their data before they impact the multiple LOBs.

Companies can definitely improve regarding the prevention of issues from happening, but reactive data corrections will never go away as data variables and complexities will always exist.

Figure 6.5 shows what is likely to occur as companies mature regarding reactive versus proactive data quality projects. Reactive issues go down as a percentage of total data quality problems identified, while proactive measures go up.

Figure 6.5 Reactive versus Proactive Data Quality Issues as a Percentage of Total in Maturing Companies

img

Data quality requests can lead to a multitude of activities. Reactive data issues will likely need a data correction step and potentially some action to prevent the offending code or practice from causing more of the same issue. This also depends on how well prepared a particular company is regarding managing data. Immature companies might fix the current problem but not make the necessary adjustments to prevent the very same error from reoccurring.

Data correction will normally fall into one of the following categories:

  • Data cleansing or scrubbing. Encompasses correcting corrupt, invalid, or inaccurate records. For example, eliminating invalid characters or extra spaces from a set of records.
  • Data standardization. Includes conforming the data to a list of established rules and standards. For example, replacing all variations of Incorporated, including Incorp, with Inc.
  • Data enrichment. Involves augmenting the data by adding some new required information. For example, adding +4 to U.S. zip codes, or adding province information to a particular European country.
  • Data consolidation. Encompasses eliminating duplicates. This particular data correction activity is likely the most time-consuming within a company. Data duplication is one of the major issues causing operational and business intelligence inefficiencies throughout the company. Before duplicates can be identified, it might be necessary to cleanse, standardize, and enrich the data so proper comparison can be achieved.
  • Data validation. This refers to preventing bad data from entering the system by performing some type of validation at the data entry level. For example, accepting only M or F values for gender on an employee or a contact.

In addition to identifying the type of data correction required, categorizing the data quality issue is also significant. Data quality issues can be categorized into quality dimensions. Data quality dimensions are explored further later in this chapter, but for the purpose of the following example, understand the completeness dimension as the level of data missing or unusable.

It is important to categorize the type of issue to ensure the correct data analysis measures are taken and to ensure a comprehensive correction plan can be mapped out. For example, a request comes in to add +4 to zip codes in some U.S. address records. An advanced data quality management program would identify this as a completeness problem and provide the recommendation that this request requires data enrichment. Furthermore, it would engage data governance to understand whether this rule applies to all U.S. addresses. If applicable to all, a data profiling activity is performed to measure the degree of completeness of ZIP+4 in all U.S. addresses. Lastly, three activities would be spawned: correct the records initially requested, correct the remaining records in the system, and prevent users from adding new U.S. addresses without the +4 information. The last activity could be achieved by adding validation at the data entry point, by changing a particular set of business processes, or if prevention is not possible, by establishing scripts to correct the data on a regular schedule.

Deciding who is on the design team can also be a challenge. The data quality forum and data stewards should certainly be highly engaged in the proposed solutions. The same considerations made to the data analysis team are applicable to the data design team. The data design team needs a combination of business and technical skills. Business skill is important to propose a solution that satisfies the needs of the multiple LOBs, and technical skill is important to propose a solution that is feasible.

When proposing a solution, it is necessary to take into account the potential rules, policies, and procedures that could be impacting the compliance of the resolution. Only conforming solutions should be considered when searching for final approval. If none of them are compliant, an amendment to existing controls should be requested to the data governance team. If an alteration is possible, the solution can move on to the approval stage. If not, the data quality lead/team will need to work with the driver(s) and impacted LOBs to search for alternatives.

Once approval is issued for the solution that best fits the business need, it is time to move on to the proper team(s) for execution.

IT Support/Data Stewards

In the previous section, a description was provided for how a data correction request can turn into a data correction activity plus a root cause analysis and consequent action for error prevention.

The proposed solution and accompanying actions will dictate who needs to be engaged during the execution phase. It is possible that business data stewards need to be involved, or IT, or both. If the data issue is a consequence of some system bug, IT needs to be involved to implement the fixes. IT may also need to be included to write some sort of automated data correction script. Data stewards will likely be involved most of the time. As described in Chapter 5, data stewardship is a business function; therefore, they are ultimately responsible for assuring the data fitness for use. Even if IT is fixing the problem, data stewards should be engaged in testing the fix and supporting the drivers throughout customer acceptance.

In Figure 6.1, the targets of step 8, execution, are represented as data sources. That is a generic representation. Depending on the MDM implementation, there could be a single source or multiple sources of data. Many operational data sources may exist during the transition to MDM to achieve the ultimate goal of establishing a single system-of-record for a particular set of information. Therefore, the location varies where the data correction and/or associated prevention activity is executed. For example, when transitioning to a central master data repository, the action to fix the data could be changing it in the legacy system before the data is migrated to the new repository. In another example, a bug fix could be required to a particular interface that is bringing the data together. Even on a complete MDM solution implemented with hub and spoke architecture, it is possible the fix is not at the hub, but in one of the spoke systems, or the underlying interface. Companies need to trust the data quality forum team members to obtain a clear understanding of where the correction should take place and engage the proper support teams to execute it accordingly and in harmony.

Metrics

Data quality metrics measure the quality of data so proper actions can be taken to prevent a recurrence of previous issues and a deterioration of data quality over time—especially as new data enters the company at exceedingly rapid rates. Metrics can be separated into two categories: (1) monitors and (2) scorecards. Monitors are used to detect violations that usually require immediate corrective actions. Scorecards allow for a number to be associated with the quality of the data and are more snapshot-in-time reports as opposed to real-time triggers.

Data quality metrics will be described in more detail in Chapter 8. At their most basic definition, data quality metrics are resources to be used by drivers and data governance to support requirements, prevent degradation in data quality, and address compliance issues.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset