Establishing a Data Quality Baseline

,

“If you cannot measure it, you cannot improve it.” This quote by Lord Kelvin is as relevant today for data quality as it was when it was first written in 1883 regarding sciences in general.

Many companies have no idea about the quality of their data. Most of the time they recognize their issues, but they don't realize the extent of those issues. Furthermore, they continue to make wrong assumptions and incorrect business decisions because they are relying upon incorrect information.

What companies need is an assessment of the quality of their information combined with a strong process for continuous improvement. The first step is to get situated by establishing a data quality baseline. This section will discuss what needs to be considered when creating this baseline, while Chapter 8 will describe a data maintenance model for continuous data quality measurement and improvement.

Many companies fail to scope their data quality projects with an emphasis on business needs. It is not data quality for the sake of data quality. It must serve a purpose, and as such, the business must be driving what needs to be measured—and why—on a traditional top-down approach.

It is easy for business teams to get overwhelmed with their day-to-day activities and lose sight of data quality improvements because they may not see immediate benefits. It is not uncommon for data quality projects to take a while before they start showing results. That's why a culture of emphasis on quality is so important—and why executive sponsorship is an absolute requirement for success. Furthermore, to overcome a potential business paralysis, or lack of business support, it might be necessary to combine a bottom-up approach to achieve quicker results.

Before providing an example for organizing a quality baseline for customer data by combining a top-down with a bottom-up approach, let's take a look at three elements that need to be considered when defining, qualifying, classifying, and cataloging a measurement baseline. They are:

1. Context

2. Data quality dimensions

3. Entities and attributes

Context

Bear in mind that quality of data is context dependent. That means a particular data element may be of sufficient quality in one specific context, but not in another.

Figure 6.6 shows an address element in multiple contexts. An address may be accurate from a postal code perspective because it is a valid location, but if it is not a valid address for a given customer, it could be of little use to marketing, for instance. From a finance perspective, it is very important that the address is the correct bill-to location in addition to being a valid address for the customer.

Figure 6.6 Context in Data Quality

img

The actual technique for assessing the quality of the data will differ according to the context with varying degrees of difficulty. In the example in Figure 6.6, determining if an address is a valid location can be achieved by comparing it against a reference provided by the postal office for that particular country. Validating an address for a given customer will require other references, such as Dun & Bradstreet or OneSource, for example. Finally, validating if the address is indeed the correct bill-to location may have to be done through inspection, or with an integrated invoice system that can flag incorrect addresses based upon proper confirmation from the customer.

Another very important context to consider is country or region/territory. Countries have their own address formatting peculiarities and requirements for proper postal delivery. A lot of rules and regulations are country specific, with business units in large corporations operating differently depending on the country where they are located or where their customers are located. Sales and marketing are normally territory driven from a customer operations perspective as well as from a sales force concentration and compensation perspective. All of these factors impact how to evaluate the quality of the information.

The need for contextual understanding will certainly dictate how much work is necessary to break down the evaluation of the information. For example, a postal code validation is typically required by country. If a finance team has different tax rules dependent on a particular address element, it may be necessary to separate that information by region or territory. All of a sudden, Figure 6.6 becomes a matrix of possibilities, such as: Post Office/U.S., Post Office/CA, Post Office/JP, etc; and Finance/Americas, Finance/EMEA, Finance/APAC, and so on. The number can grow exponentially depending on the amount of segmentation required.

Data Quality Dimensions

Data quality dimension is a characteristic of information that can be used to effectively organize and measure the information and properly address its data quality needs. Furthermore, data quality dimensions should be aligned with business processes to be measured.

In essence, data quality dimensions are distinguishable characteristics of a data element, and the actual distinction should be business driven. Since it is possible to characterize a particular data element in so many ways, there are nearly 200 terms used to portray data quality dimensions. There is not a single comprehensive list agreed upon by experts, but most would say it would include the following:

  • Completeness. Level of data missing or unusable.
  • Conformity. Degree of data stored in a nonstandard format.
  • Consistency. Level of conflicting information.
  • Accuracy. Degree of agreement with an identified source of correct information.
  • Uniqueness. Level of non-duplicates.
  • Integrity. Degree of referenced data.
  • Timeliness. Degree to which data is current and available for use in the expected time frame.

Entities and Attributes

As explained previously, business paralysis can be an impediment when looking for a clear definition of what needs to be included as part of the data quality baseline. It may be necessary to start a bottom-up approach by using a conceptual, logical, or physical model of the data repository to help identify the main entities and attributes that need to be assessed. Conceptual or logical models are usually easier to follow for nontechnical people, and are probably the preferred models among business users. Referring to the physical model will likely be necessary to obtain more details about what is not represented in the other two models. Metadata, when present, can also be very useful in this activity. But more often than not, metadata is not existent at the detail needed.

If the repository is relatively unknown, and there is no metadata information and no database models, it will be necessary to obtain table descriptions directly from the database. An extremely useful exploratory activity here would be data profiling to learn more about the data before deciding what needs to be included in the baseline. This is also fairly common when migrating an unknown system into the primary repository.

To illustrate, let's use a simplified conceptual model entity relationship diagram (ERD), which is shown in Figure 6.7 representing the most common entities on a customer data repository. Please note all database examples and representations will be given using relational database notation and nomenclature since that is the most dominant technology today for data management systems.

Figure 6.7 Simplified Conceptual Model ERD for Customer Information

img

The model in Figure 6.7 only shows entities, but it can be used as a starting point to question the business for feedback. Chances are the model alone is not sufficient, and it is necessary to drill down into what attributes those entities carry to better come up with what needs to be included in the baseline. Some logical models do include attribute information, and so do physical models and data dictionaries. If neither of those is available, it might be necessary to resort to a direct description of the database tables, which can be easily provided by a database administrator (DBA).

Remember, however, that those descriptions will simply list the name of the attributes and their types—integer, character, and so on. A well-designed database will have some minimum standards that can facilitate the understanding of some of the attributes/columns. For example, an operational database will likely be in third normal form (3NF), and its primary and foreign keys will probably follow a standard nomenclature making it relatively easy to spot relationships between entities.

Whatever technique is used, either using models, data dictionaries, metadata repository, existing documentation, or direct table descriptions, it should be possible to compile a list of the most important entities and attributes to be further analyzed. Obviously, importance is subjective, but keep in mind it must serve a business purpose. When in doubt, the entity and attribute should be included until further clarification and review with the business.

One thing to keep in mind, though—the business doesn't know what they don't know. It is very common for LOBs to be reactive and not proactive with regard to data issues. A lot of times they don't specify a particular data constraint, because they assume there is no problem. Later, however, when they are presented with a given scenario, they come to the painful realization that there is a problem. This is why even though the main goal is to create a baseline that is representative of the actual business needs, it is vital to be forward thinking and proactive to identify what has potentially been overlooked. A bottom-up approach can assist with creating a more comprehensive list with elements that were unnoticed by the business.

Next are some sample entities, attributes, data quality dimensions, and contexts used to describe a method for compiling a data quality baseline that meets business needs and goals.

Putting It All Together

Chapter 2 makes the argument that a combination of top-down with bottom-up approach is critical to fully harvest the maximum benefits of data management. Creating a data quality baseline is a concrete application of this concept, where the proper combination is normally necessary to create a culminating list of entities and attributes to be included in a baseline for quality assessment. Context and data quality dimensions are added to the mix to help categorize the many elements and their many facets of data quality.

Figure 6.8 depicts a diagram combining a top-down with a bottom-up approach for compiling and classifying a list of elements to be measured when creating a data quality baseline.

Figure 6.8 Creating a Data Quality (DQ) Baseline

img

A sample baseline report is depicted in Table 6.1.

Table 6.1 Sample Baseline Report.

img

img

Most of the columns in Table 6.1 were discussed previously, such as context, entity, attribute, and DQ dimension. But let's take a look at each of them:

  • Originator: used to track where the requirement came from. For top-down requirements, originators are likely to be a particular LOB, or a combination of LOBs, data governance, or data stewardship. Bottom-up requirements were likely identified by the data quality team, or potentially by the data governance team, which is likely engaged in this activity, as well.
  • Context: discussed in detail previously. It is not uncommon for originator and context to carry the same label. After all, a given requirement by finance, for example, could very well be applicable to that particular context. But they are not the same thing. Originator is simply who made the request, while context takes into account all the conditions affecting the assessment of the quality of the data.
  • Entity: discussed in detail previously. Both top-down and bottom-up requirements can be traced to a given entity or combination of entities.
  • Attribute: discussed in detail previously. Entities combined with attributes ultimately identify what needs to be measured.
  • Data quality dimension: discussed in detail previously. Used to distinguish the characteristic of a data element. Can be further used to roll up the quality of the data into different views, which will be explained in more detail in Chapter 8.
  • Data quality rule: a detailed description of the rule defining what is being measured. It is very important to be precise in the description to avoid incorrect interpretations. For example, when measuring completeness of a name, what does that really mean? Is it any combination of characters, with the exception of nulls and blanks? Is there a minimum length acceptable?
  • Score: a measure of quality. It is important to be consistent with the score. Use the same approach for all elements. Don't use a high number for one element to identify high quality, while using a low number for another element to also identify high quality. For example, a high number for completeness is normally a good indication, while a high number for duplicates is not. Therefore, use uniqueness instead of duplicates. A high number for uniqueness is good. A common practice is to have a score between 0 to 100, which represents a percentage of good data. This means a score of 80, which indicates 80 percent of the data follows the specified rule.
  • Threshold (not in Table 6.1): can be used to define acceptable tolerances. When creating an initial baseline, this particular characteristic may not be of ultimate importance. However, as the understanding of the data increases, this attribution becomes critical. More about thresholds will be presented in Chapter 8.

Obviously, more columns can be added to the sample assessment on Table 6.1 according to a particular situation. It is easy to envision how the report can become very lengthy and the elements can potentially be very difficult to compile, let alone measure every single one of them. Measuring the data according to the predefined rule can be a very daunting task. The good news is that there are automated technologies to support this effort. It will require an excellent data-profiling tool combined with a very skilled group of professionals. It is very possible that not all rules can be fully measured; therefore, sampling techniques and other statistical data analysis must be considered as part of the profiling exercise.

One may argue a bottom-up approach can lead to a set of difficult rules to be measured and is consequently a wasted effort since it doesn't necessarily meet a particular business need. Implementing a phased baseline can mitigate this possible scenario. Keep in mind that the purpose of using the bottom-up approach is to show the business what they are potentially missing (remember, the business doesn't know what they don't know!). That means one can start a baseline of bottom-up elements with a simple set of rules. As the results uncover discrepancies, the baseline can be refined further as needed.

The step creating a baseline and the one described in the next section, data alignment and fitness assessment, may require a few interactions before the actual baseline can be fully completed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset