8 Reference and Master Data Management

Reference and Master Data Management is the sixth Data Management Function in the data management framework shown in Figures 1.3 and 1.4. It is the fifth data management function that interacts with and is influenced by the Data Governance function. Chapter 8 defines the reference and master data management function and explains the concepts and activities involved in reference and master data management.

8.1 Introduction

In any organization, different groups, processes, and systems need the same information. Data created in early processes should provide the context for data created in later processes. However, different groups use the same data for different purposes. Sales, Finance, and Manufacturing departments all care about product sales, but each department has different data quality expectations. Such purpose-specific requirements lead organizations to create purpose-specific applications, each with similar but inconsistent data values in differing formats. These inconsistencies have a dramatically negative impact on overall data quality.

Reference and Master Data Management is the ongoing reconciliation and maintenance of reference data and master data.

  • Reference Data Management is control over defined domain values (also known as vocabularies), including control over standardized terms, code values and other unique identifiers, business definitions for each value, business relationships within and across domain value lists, and the consistent, shared use of accurate, timely and relevant reference data values to classify and categorize data.
  • Master Data Management is control over master data values to enable consistent, shared, contextual use across systems, of the most accurate, timely, and relevant version of truth about essential business entities.

Reference data and master data provide the context for transaction data. For example, a customer sales transaction identifies customer, the employee making the sale, and the product or service sold, as well as additional reference data such as the transaction status and any applicable accounting codes. Other reference data elements are derived, such as product type and the sales quarter.

As of publication of this guide, no single unique term has been popularized that encompasses both reference and master data management. Sometimes one or the other term refers to both reference and master data management. In any conversation using these terms, it is wise to clarify what each participant means by their use of each term.

The context diagram for Reference and Master Data Management is shown in Figure 8.1 The quality of transaction data is very dependent on the quality of reference and master data. Improving the quality of reference and master data improves the quality of all data and has a dramatic impact on business confidence about its own data.

Figure 8.1 Reference and Master Data Management Context Diagram

Hence, all reference and master data management programs are specialized data quality improvement programs, requiring all the data quality management activities described in Chapter 12. These programs are also dependent on active data stewardship and the data governance activities described in Chapter 3. Reference and master data management is most successful when funded as an on-going data quality improvement program, not a single, one-time-only project effort.

The cost and complexity of each program is determined by the business drivers requiring the effort. The two most common drivers for Reference and Master Data Management are:

  • Improving data quality and integration across data sources, applications, and technologies
  • Providing a consolidated, 360-degree view of information about important business parties, roles and products, particularly for more effective reporting and analytics.

Given the cost and complexity of the effort, implement any overall solution iteratively, with clear understanding of the business drivers, supported by existing standards as well as prior lessons learned, in close partnership with business data stewards.

8.2 Concepts and Activities

While both reference data management and master data management share similar purposes and many common activities and techniques, there are some distinct differences between the two functions. In reference data management, business data stewards maintain lists of valid data values (codes, and so on) and their business meanings, through internal definition or external sourcing. Business data stewards also manage the relationships between reference data values, particularly in hierarchies.

Master data management requires identifying and / or developing a “golden” record of truth for each product, place, person, or organization. In some cases, a “system of record” provides the definitive data about an instance. However, even one system may accidentally produce more than one record about the same instance. A variety of techniques are used to determine, as best possible, the most accurate and timely data about the instance.

Once the most accurate, current, relevant values are established, reference and master data is made available for consistent shared use across both transactional application systems and data warehouse / business intelligence environments. Sometimes data replicates and propagates from a master database to one or more other databases. Other applications may read reference and master data directly from the master database.

Reference and master data management occurs in both online transaction processing (OLTP) and in data warehousing and business intelligence environments. Ideally, all transaction-processing databases use the same golden records and values. Unfortunately, most organizations have inconsistent reference and master data across their transaction systems, requiring data warehousing systems to identify not only the most truthful system of record, but the most accurate, golden reference and master data values. Much of the cost of data warehousing is in the cleansing and reconciliation of reference and master data from disparate sources. Sometimes, organizations even maintain slowly changing reference data in dimensional tables, such as organizational and product hierarchies, within the data warehousing and business intelligence environment, rather than maintaining the data in a master operational database and replicating to other operational databases and to data warehouses.

To share consistent reference and master data across applications effectively, organizations need to understand:

  • Who needs what information?
  • What data is available from different sources?
  • How does data from different sources differ? Which values are the most valid (the most accurate, timely, and relevant)?
  • How can inconsistencies in such information be reconciled?
  • How to share the most valid values effectively and efficiently?

8.2.1 Reference Data

Reference data is data used to classify or categorize other data. Business rules usually dictate that reference data values conform to one of several allowed values. The set of allowable data values is a value domain. Some organizations define reference data value domains internally, such as Order Status: New, In Progress, Closed, Cancelled, and so on. Other reference data value domains are defined externally as government or industry standards, such as the two-letter United States Postal Service standard postal code abbreviations for U.S. states, such as CA for California.

More than one set of reference data value domains may refer to the same conceptual domain. Each value is unique within its own value domain. For example, each state may have:

  • An official name (“California”).
  • A legal name (“State of California”).
  • A standard postal code abbreviation (“CA”).
  • An International Standards Organization (ISO) standard code (“US-CA”).
  • A United States Federal Information Processing Standards (FIPS) code (“06”).

In all organizations, reference data exists in virtually every database across the organization. Reference tables (sometimes called code tables) link via foreign keys into other relational database tables, and the referential integrity functions within the database management system ensure only valid values from the reference tables are used in other tables.

Some reference data sets are just simple two-column value lists, pairing a code value with a code description, as shown in Table 8.1. The code value, taken from the ISO 3166-1993 Country Code List, is the primary identifier, the short form reference value that appears in other contexts. The code description is the more meaningful name or label displayed in place of the code on screens, drop-down lists, and reports.

Code Value

Description

US

United States of America

GB

United Kingdom (Great Britain)

Table 8.1 Sample ISO Country Code Reference Data

Note that in this example, the code value for United Kingdom is GB according to international standards, and not UK, even though UK is a common short form using in many forms of communication.

Some reference data sets cross-reference multiple code values representing the same things. Different application databases may use different code sets to represent the same conceptual attribute. A master cross-reference data set enables translation from one code to another. Note that numeric codes, such as the FIPS state numeric codes shown in Table 8.2, are limited to numeric values, but arithmetic functions cannot be performed on these numbers.

USPS State Code

ISO State Code

FIPS Numeric State Code

State Abbreviation

State Name

Formal State Name

CA

US-CA

06

Calif.

California

State of California

KY

US-KY

21

Ky.

Kentucky

Commonwealth of Kentucky

WI

US-WI

55

Wis.

Wisconsin

State of Wisconsin

Table 8.2 Sample State Code Cross-Reference Data

Some reference data sets also include business definitions for each value. Definitions provide differentiating information that the label alone does not provide. Definitions rarely display on reports or drop-down lists, but they may appear in the Help function for applications, guiding the appropriate use of codes in context.

Using the example of help desk ticket status in Table 8.3, without a definition of what the code value indicates, ticket status tracking cannot occur effectively and accurately. This type of differentiation is especially necessary for classifications driving performance metrics or other business intelligence analytics.

Code

Description

Definition

1

New

Indicates a newly created ticket without an assigned resource

2

Assigned

Indicates a ticket that has a named resource assigned

3

Work In Progress

Indicates the assigned resource started working on the ticket

4

Resolved

Indicates request is assumed to be fulfilled per the assigned resource

5

Cancelled

Indicates request was cancelled based on requester interaction

6

Pending

Indicates request cannot proceed without additional information.

7

Fulfilled

Indicates request was fulfilled and verified by the requester

Table 8.3 Sample Help Desk Reference Data

Some reference data sets define a taxonomy of data values, specifying the hierarchical relationships between data values using the Universal Standard Products and Services Classification (UNSPSC), as shown in Table 8.4. Using taxonomic reference data, capture information at different levels of specificity, while each level provides an accurate view of the information.

Taxonomic reference data can be important in many contexts, most significantly for content classification, multi-faceted navigation, and business intelligence. In traditional relational databases, taxonomic reference data would be stored in a recursive relationship. Taxonomy management tools usually maintain hierarchical information, among other things.

Code Value

Description

Parent Code

10161600

Floral plants

10160000

10161601

Rose plants

10161600

10161602

Poinsettias plants

10161600

10161603

Orchids plants

10161600

10161700

Cut flowers

10160000

10161705

Cut roses

10161700

Table 8.4 Sample Hierarchical Reference Data

Meta-data about reference data sets may document:

  • The meaning and purpose of each reference data value domain.
  • The reference tables and databases where the reference data appears.
  • The source of the data in each table.
  • The version currently available.
  • When the data was last updated.
  • How the data in each table is maintained.
  • Who is accountable for the quality of the data and meta-data.

Reference data value domains change slowly. Business data stewards should maintain reference data values and associated meta-data, including code values, standard descriptions, and business definitions. Communicate to consumers any additions and changes to reference data sets.

Business data stewards serve not only as the accountable authority for internally defined reference data sets, but also as the accountable authority on externally defined standard reference data sets, monitoring changes and working with data professionals to update externally defined reference data when it changes.

8.2.2 Master Data

Master data is data about the business entities that provide context for business transactions. Unlike reference data, master data values are usually not limited to pre-defined domain values. However, business rules typically dictate the format and allowable ranges of master data values. Common organizational master data includes data about:

  • Parties include individuals, organizations, and their roles, such as customers, citizens, patients, vendors, suppliers, business partners, competitors, employees, students, and so on.
  • Products, both internal and external.
  • Financial structures, such as general ledger accounts, cost centers, profit centers, and so on.
  • Locations, such as addresses.

Master data is the authoritative, most accurate data available about key business entities, used to establish the context for transactional data. Master data values are considered golden.

The term master data management has its roots in the term master file, a phrase coined before databases became commonplace. Some believe master data management (MDM) to be a fashionable buzzword, soon to be replaced by some other new buzzword. However, the need for high quality reference and master data is timeless and the techniques and activities of reference and master data management will be valuable for many years to come.

Master Data Management is the process of defining and maintaining how master data will be created, integrated, maintained, and used throughout the enterprise. The challenges of MDM are 1) to determine the most accurate, golden data values from among potentially conflicting data values, and 2) to use the golden values instead of other less accurate data. Master data management systems attempt to determine the golden data values and then make that data available wherever needed.

MDM can be implemented through data integration tools (such as ETL), data cleansing tools, operational data stores (ODS) that serve as master data hubs, or specialized MDM applications. There are three primary MDM focus areas:

  1. Identification of duplicate records within and across data sources to build and maintain global IDs and associated cross-references to enable information integration.
  2. Reconciliation across data sources and providing the “golden record” or the best version of the truth. These consolidated records provide a merged view of the information across systems and seek to address name and address inconsistencies.
  3. Provision of access to the golden data across applications, either through direct reads, or by replication feeds to OLTP and DW / BI databases.

MDM challenges organizations to discover:

  • What are the important roles, organizations, places, and things referenced repeatedly?
  • What data is describing the same person, organization, place, or thing?
  • Where is this data stored? What is the source for the data?
  • Which data is more accurate? Which data source is more reliable and credible? Which data is most current?
  • What data is relevant for specific needs? How do these needs overlap or conflict?
  • What data from multiple sources can be integrated to create a more complete view and provide a more comprehensive understanding of the person, organization, place, or thing?
  • What business rules can be established to automate master data quality improvement by accurately matching and merging data about the same person, organization, place, or thing?
  • How do we identify and restore data that was inappropriately matched and merged?
  • How do we provide our golden data values to other systems across the enterprise?
  • How do we identify where and when data other than the golden values is used?

Different groups that interact with different parties have different data quality needs and expectations. Many data inconsistencies cannot be resolved through automated programs and need to be resolved through data governance.

MDM solution requirements may be different, depending on the type of master data (party, financial, product, location, and so on) and the type of support transactions need. Implement different solution architectures based on the solution needs, structure of the organization, and business drivers for MDM. MDM data hubs and applications may specialize in managing particular master data subject areas.

8.2.2.1 Party Master Data

Party master data includes data about individuals, organizations, and the roles they play in business relationships. In the commercial environment, this includes customer, employee, vendor, partner, and competitor data. In the public sector, the focus is on data about citizens. In law enforcement, the focus is on suspects, witnesses, and victims. In not-for-profit organizations, the focus is on members and donors. In healthcare, the focus is on patients and providers, while in education, the focus in on students and faculty.

Customer relationship management (CRM) systems perform MDM for customer data, in addition to other business functions. MDM for customer data is also called Customer Data Integration (CDI). CRM databases attempt to provide the most complete and accurate information about each and every customer. CRM systems compare customer data from multiple sources. An essential aspect of CRM is identifying duplicate, redundant, and conflicting data about the same customer.

  • Is this data about the same customer or two different customers?
  • If the data is about the same customer, which data values conflict, and which are more accurate? Which data sources are more trustworthy?

Other systems may perform similar MDM functions for individuals, organizations and their roles. For example, human resource management (HRM) systems manage master data about employees and applicants. Vendor management systems manage master data about suppliers.

Regardless of industry, managing business party master data poses unique challenges due to:

  • The complexity of roles and relationships played by individuals and organizations.
  • Difficulties in unique identification.
  • The high number of data sources.
  • The business importance and potential impact of the data.

MDM is particularly challenging for parties playing multiple roles.

8.2.2.2 Financial Master Data

Financial master data includes data about business units, cost centers, profit centers, general ledger accounts, budgets, projections, and projects. Typically, an Enterprise Resource Planning (ERP) system serves as the central hub for financial master data (chart of accounts), with project details and transactions created and maintained in one or more spoke applications. This is especially common in organizations with distributed back-office functions.

Financial MDM solutions focus on not only creating, maintaining, and sharing information, but also simulating how changes to existing financial data may affect the organization’s bottom line, such as budgeting and projections. Financial master data simulations are often part of business intelligence reporting, analysis, and planning modules with a focus on hierarchy management. Model different versions of financial structures to understand potential financial impacts. Once a decision is made, the agreed upon structural changes can be disseminated to all appropriate systems.

8.2.2.3 Product Master Data

Product master data can focus on an organization’s internal products or services or the entire industry, including competitor products, and services. Product master data may exist in structured or unstructured formats. It may include information about bill-of-materials component assemblies, part / ingredient usage, versions, patch fixes, pricing, discount terms, auxiliary products, manuals, design documents and images (CAD drawings), recipes (manufacturing instructions), and standard operating procedures. Specialized systems or ERP applications can enable product master data management.

Product Lifecycle Management (PLM) focuses on managing the lifecycle of a product or service from its conception (such as research), through its development, manufacturing, sale / delivery, service, and disposal. Organizations implement PLM systems for a number of reasons. PLM can help reduce time to market by leveraging prior information while improving overall data quality. In industries with long product development cycles (as much as 8 to 12 years in the pharmaceutical industry), PLM systems enable cross-process cost and legal agreements tracking as product concepts evolve from one idea to many potential products under different names and potentially different licensing agreements.

8.2.2.4 Location Master Data

Location master data provides the ability to track and share reference information about different geographies, and create hierarchical relationships or territories based on geographic information to support other processes. The distinction between reference and master data particularly blurs between location reference data and location master data:

  • Location reference data typically includes geopolitical data, such as countries, states / provinces, counties, cities / towns, postal codes, geographic regions, sales territories, and so on.
  • Location master data includes business party addresses and business party location, and geographic positioning coordinates, such as latitude, longitude, and altitude.

Different industries require specialized earth science data (geographic data about seismic faults, flood plains, soil, annual rainfall, and severe weather risk areas) and related sociological data (population, ethnicity, income, and terrorism risk), usually supplied from external sources.

8.2.3 Understand Reference and Master Data Integration Needs

Reference and master data requirements are relatively easy to discover and understand for a single application. It is much more difficult to develop an understanding of these needs across applications, especially across the entire enterprise. Analyzing the root causes of a data quality problem usually uncovers requirements for reference and master data integration. Organizations that have successfully managed reference and master data have focused on one subject area at a time. They analyze all occurrences of a few business entities, across all physical databases and for differing usage patterns.

8.2.4 Identify Reference and Master Data Sources and Contributors

Successful organizations first understand the needs for reference and master data. Then they trace the lineage of this data to identify the original and interim source databases, files, applications, organizations, and even the individual roles that create and maintain the data. Understand both the up-stream sources and the down-stream needs to capture quality data at its source.

8.2.5 Define and Maintain the Data integration Architecture

As discussed in Chapter 4, effective data integration architecture controls the shared access, replication, and flow of data to ensure data quality and consistency, particularly for reference and master data. Without data integration architecture, local reference and master data management occurs in application silos, inevitably resulting in redundant and inconsistent data.

There are several basic architectural approaches to reference and master data integration. Sometimes an authoritative source is easily identifiable and officially established as the system of record.

A code management system may be the system of record for many reference data sets. Its database would be the database of record. In Figure 8.2, the database of record serves as a reference data “hub” supplying reference data to other “spoke” applications and databases. Some applications can read reference and master data directly from the database of record. Other applications subscribe to published, replicated data from the database of record. Applications reading directly from a hub database must manage their own referential integrity in application code, while application databases with replicated data can implement referential integrity through the DBMS.

Figure 8.2 Reference Data Management Architecture Example

Replicated data updates other databases in real time (synchronous, coordinated updates). More commonly, replicated data is pushed to other application databases through a subscribe-and-publish approach in near-real time (asynchronous updates) as changes are made to the database of record. In other circumstances, snapshot data can be replicated as needed (pulled) from the database of record. For example, an insurance company’s claims system might be a purchased application package with its own database, with policy data replicated from the policy database of record as related claims go through processing, reflecting the current state of the policy at that point in time.

Each master data subject area will likely have its own unique system of record. The human resource system usually serves as the system of record for employee data. A CRM system might serve as the system of record for customer data, while an ERP system might serve as the system of record for financial and product data. Each system’s database may serve as the authoritative master data hub for the master data about its specialization.

Only the reference or master database of record should be the source system for replicated reference or master data supplied to data warehouses and data marts, as shown in Figure 8.3. Updates to the reference or master database of record should occur in data warehouses and data marts.

Figure 8.3 Master Data Management Architecture Example

Having many authoritative databases of record can create a very complex data integration environment An alternative implementation of the basic “hub and spokes” design is to have each database of record provide its authoritative reference and master data into a master data operational data store (ODS) that serves as the hub for all reference and master data for all OLTP applications. Some applications may even use the ODS as their driving database, while other applications have their own specialized application databases with replicated data supplied from the ODS data hub through a “subscribe and publish” approach.

In Figure 8.4, four different systems of record (A, B, C, and D) provide four different master subject areas. System A does not need data from Systems B, C, and D, and so provides direct updates to “A” master data without its own database. Systems B, C, and D have their own application databases. System B reads “A” master data directly from the ODS, and provides the ODS with master data about “B”. System C provides the ODS with “C” master data. Like System B, it also reads “A” master data directly from the ODS, but it subscribes to replicated “B” master data from the ODS. System D supplies “D” master data to the ODS, and receives feeds from the ODS for master data about subject areas A, B and C.

The primary advantage of this design is the standardization of interfaces to the ODS and the elimination of point-to-point interfaces. This advantage simplifies maintenance changes.

Figure 8.4 Reference and Master Data Hub Operational Data Store (ODS)

The data hub model is particularly useful when there is no clear system of record for the master data. In this case, multiple systems supply data. New data or updates from one system may need to be reconciled with data already supplied by another system. The ODS becomes the primary (if not sole) source of the data warehouse, reducing the complexity of extracts and the processing time for data transformation, cleansing and reconciliation. Of course, data warehouses must reflect historical changes made to the ODS, while the ODS may need to only reflect the current state.

The data integration architecture should also provide common data integration services, as shown in Figure 8.5. These services include:

  • Change request processing, including review and approval.
  • Data quality checks on externally acquired reference and master data.
  • Consistent application of data quality rules and matching rules.
  • Consistent patterns of processing.
  • Consistent meta-data about mappings, transformations, programs and jobs.
  • Consistent audit, error resolution and performance monitoring data.
  • Consistent approaches to replicating data (including “subscribe and publish”).

Figure 8.5 Data Integration Services Architecture

To reconcile inconsistent reference and master data effectively, it is important to both identify which data elements are represented inconsistently, and determine how best to represent the data. Establishing master data standards can be a time consuming task as it may involve multiple stakeholders. Training may also be required for those who have been used to seeing data in other formats. Apply the same data standards, regardless of integration technology, to enable effective standardization, sharing, and distribution of reference and master data.

8.2.6 Implement Reference and Master Data Management Solutions

Reference and master data management solutions cannot be implemented overnight. Given the variety, complexity, and instability of requirements, no single solution or implementation project is likely to meet all reference and master data management needs. Organizations should expect to implement reference and master data management solutions iteratively and incrementally through several related projects and phases, guided by their architecture, business priorities, and an implementation program roadmap.

Some organizations may have a centralized code management system that provides business data stewards with a common, consistent facility for maintaining golden, authoritative reference data values. The code management system serves as the system of record for reference data under its control. Other systems requiring access to reference data either read it directly from the code management database, or receive replicated data from the central code management database as it updates. These other systems include both transaction management systems and data warehouses. Despite best efforts, these systems are rarely complete in scope; somehow, pockets of unmanaged reference data persist.

Several vendors offer master data management applications. Typically, these applications are specialized for customer data integration (CDI), product data integration (PDI) or some other master data subject area, such as other parties, locations, and financial structures. Some also manage hierarchical relationships in business intelligence environments. Other vendors promote use of their data integration software products and implementation services to create custom master data management solutions for the organization.

8.2.7 Define and Maintain Match Rules

One of the greatest on-going challenges in master data management is the matching, merging, and linking of data from multiple systems about the same person, group, place, or thing. Matching is particularly challenging for data about people. Different identifiers in different systems relate to individuals (and organizations to a lesser extent), sometimes for different roles and sometimes for the same role. Matching attempts to remove redundancy, to improve data quality, and provide information that is more comprehensive.

Perform data matching by applying inference rules. Data cleansing tools and MDM applications often include matching inference engines used to match data. These tools are dependent on clearly defined matching rules, including the acceptability of matches at different confidence levels.

Some matches occur with great confidence, based on exact data matches across multiple fields. Other matches are suggested with less confidence due to conflicting values. For example:

  • If two records share the same last name, first name, birth date, and social security number, but the street address differs, is it safe to assume they are about the same person who has changed their mailing address?
  • If two records share the same social security number, street address, and first name, but the last name differs, is it safe to assume they are about the same person who has changed their last name? Would the likelihood be increased or decreased based on gender and age?
  • How do these examples change if the social security number is unknown for one record? What other identifiers are useful to determine the likelihood of a match? How much confidence is required for the organization to assert a match?

Despite the best efforts, match decisions sometimes prove to be incorrect. It is essential to maintain the history of matches so that matches can be undone when discovered to be incorrect. Match rate metrics enable organizations to monitor the impact and effectiveness of their matching inference rules.

Establish match rules for three primary scenarios with their different associated workflows:

  • Duplicate identification match rules focus on a specific set of fields that uniquely identify an entity and identify merge opportunities without taking automatic action. Business data stewards can review these occurrences and decide to take action on a case-by-case basis.
  • Match-merge rules match records and merge the data from these records into a single, unified, reconciled, and comprehensive record. If the rules apply across data sources, create a single unique and comprehensive record in each database. Minimally, use trusted data from one database to supplement data in other databases, replacing missing values or values thought to be inaccurate.
  • Match-link rules identify and cross-reference records that appear to relate to a master record without updating the content of the cross-referenced record. Match-link rules are easier to implement and much easier to reverse.

Match-merge rules are complex due to the need to identify so many possible circumstances, with different levels of confidence and trust placed on data values in different fields from different sources. The challenges with match-merge rules are 1) the operational complexity of reconciling the data, and 2) the cost of reversing the operation if there is a false merge.

Match-link, on the other hand, is a simple operation, as it acts on the cross-reference table and not the individual fields of the merged master data record, even though it may be more difficult to present comprehensive information from multiple records.

Periodically re-evaluate match-merge and match-link rules. because confidence levels change over time. Many data matching engines provide statistical correlations of data values to help establish confidence levels.

Assign Global IDs to link and reconcile matched records about the same person from different data sources. Generate Global IDs by only one authorized system, so that each value is unique. Then assign Global IDs to records across systems for cross-reference, matching data with different identifiers but thought to be about the same person.

8.2.8 Establish Golden Records

The techniques used to establish the most accurate and complete reference data are different from the techniques used to provide the most accurate and complete master data. Because reference data sets are value domains with distinct values, manage each reference data set as a controlled vocabulary. Establishing golden master data values requires more inference, application of matching rules, and review of the results.

8.2.8.1 Vocabulary Management and Reference Data

A vocabulary is a collection of terms / concepts and their relationships. Describe the terms / concepts at many levels of detail. The relationships may or may not be strictly hierarchical. Business data stewards maintain vocabularies and their associated reference data sets (codes, labels, meanings, associations). Vocabulary management is defining, sourcing, importing, and maintaining a vocabulary and its associated reference data.

ANSI / NISO Z39.19-2005, which provides the Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies, describes vocabulary management as a way “to improve the effectiveness of information storage and retrieval systems, Web navigation systems, and other environments that seek to both identify and locate desired content via some sort of description using language. The primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval.”

Some of the key questions to ask to enable vocabulary management are:

  • What information concepts (data attributes) will this vocabulary support?
  • Who is the audience for this vocabulary? What processes do they support, and what roles do they play?
  • Why is the vocabulary needed? Will it support applications, content management, analytics, and so on?
  • Who identifies and approves the preferred vocabulary and vocabulary terms?
  • What are the current vocabularies different groups use to classify this information? Where are they located? How were they created? Who are their subject matter experts? Are there any security or privacy concerns for any of them?
  • Are there existing standards that can be leveraged to fulfill this need? Are there concerns about using an external standard vs. internal? How frequently is the standard updated and what is the degree of change of each update? Are standards accessible in an easy to import / maintain format in a cost efficient manner?

Understanding the answers to these questions will enable more effective data integration.

The most significant activity in vocabulary management is the identification of the standard list of preferred terms and their synonyms (equivalent terms). Data profiling can help assess term values and frequencies in order to assess potential risk and complexity in vocabulary management.

Vocabulary management requires data governance, enabling data stewards to assess stakeholder needs, and the impacts of proposed changes, before making collaborative and formally approved decisions.

8.2.8.2 Defining Golden Master Data Values

Golden data values are the data values thought to be the most accurate, current, and relevant for shared, consistent use across applications. Organizations determine golden values by analyzing data quality, applying data quality rules and matching rules, and incorporating data quality controls into the applications that acquire, create, and update data.

Applications can enforce data quality rules, including:

  • Incorporating simple edit checks against referenced data and key business rules.
  • Ensuring new records, such as addresses, that are being entered do not already exist in the system through applying data standardization and search-before-create automation.
  • Creating prompts for the user if data does not meet accuracy (this address does not exist) expectations, while providing a way to submit exceptions that can be audited in the future.

Establish data quality measurements to set expectations, measure improvements, and help identify root causes of data quality problems. Assess data quality through a combination of data profiling activities and verification against adherence to business rules.

Term and abbreviation standardization is a type of data cleansing activity that ensures certain terms and short forms of those terms consistently appear in the standardized data set, as shown in Figure 8.5. Data cleansing tools typically provide address standardization dictionaries that translate different words and abbreviations to a standard word or abbreviation. For example, “St”, “Str”, “Street” may all map to “St.”. Sometimes the same abbreviation will be used for more than one term, such as, “Saint” may also be abbreviated as “St.”, making any automatic reverse translation from abbreviation to full word extremely difficult. Many other names may need standardization, such as organization names (U., Univ, University, and so on) and product names. All data consumers should have ready access to the definitions for standard abbreviations.

Source ID

Name

Address

Telephone

123

John Smith

123 Main, Dataland, SQ 98765

234

J. Smith

123 Main, Dataland, SQ

2345678900

Source Data

Source ID

Name

Address

Telephone

123

John Smith

123 Main St., Dataland, SQ 98765

234

J. Smith

123 Main St., Dataland, SQ 98765

+1 234 567 9800

Cleansed / Standardized Data

Table 8.5 Data Standardization Example

Exposing one set of data quality rules in the integration environment (ETL, web services, and so on) will allow any data source to leverage one set of validation and standardization rules.

Once the data is standardized and cleansed, the next step is to attempt reconciliation of redundant data through application of matching rules.

8.2.9 Define and Maintain Hierarchies and Affiliations

Vocabularies and their associated reference data sets are often more than lists of preferred terms and their synonyms. They may also include hierarchical relationships between the terms. These relationships may be general-to-specific classifications (“is a kind of” relationships) or whole-part assemblies (“is a part of” relationships). There may also be non-hierarchical relationships between terms that are worth identifying.

Affiliation management is the establishment and maintenance of relationships between master data records. Examples include ownership affiliations (such as Company X is a subsidiary of Company Y, a parent-child relationship) or other associations (such as Person XYZ works at Company X). Managing hierarchies specifically within a business intelligence environment is sometimes called dimension hierarchy management.

8.2.10 Plan and Implement Integration of New Data Sources

Integrating new reference data sources involves (among other tasks):

  • Receiving and responding to new data acquisition requests from different groups.
  • Performing data quality assessment services using data cleansing and data profiling tools.
  • Assessing data integration complexity and cost.
  • Piloting the acquisition of data and its impact on match rules.
  • Determining who will be responsible for data quality.
  • Finalizing data quality metrics.

8.2.11 Replicate and Distribute Reference and Master Data

Reference and master data may be read directly from a database of record, or may be replicated from the database of record to other application databases for transaction processing, and data warehouses for business intelligence. By replicating the data, the application database can more easily ensure referential integrity. In other words, the database can ensure that only valid reference data codes and master data identifiers are used as foreign key values in other tables, providing the context for related data. Data integration procedures must ensure timely replication and distribution of reference and master data to these application databases.

Reference data most commonly appears as pick list values in applications. Reference data values also commonly appear as search criteria in content management engines. Reference data values found in unstructured documents are often indexed to enable quick searches.

8.2.12 Manage Changes to Reference and Master Data

In a managed master data environment, specific individuals have the role of a business data steward. They have the authority to create, update, and retire reference data values, and to a lesser extent, in some circumstances, master data values. Business data stewards work with data professionals to ensure the highest quality reference and master data. Many organizations define more specific roles and responsibilities, with individuals often performing more than one role.

Reference data sets change slowly. Formally control changes to controlled vocabularies and their reference data sets by following the basic change request process:

  1. Create and receive a change request.
  2. Identify the related stakeholders and understand their interests.
  3. Identify and evaluate the impacts of the proposed change.
  4. Decide to accept or reject the change, or recommend a decision to management or governance.
  5. Review and approve or deny the recommendation, if needed.
  6. Communicate the decision to stakeholders prior to making the change.
  7. Update the data.
  8. Inform stakeholders the change has been made.

Changes to internal or external reference data sets may be minor or major. For example, country code lists go through minor revisions as geopolitical space changes. When the Soviet Union broke into many independent states, the term for Soviet Union was deprecated with an end of life date, and new terms added for new countries. On the other hand, the ICD-9 Diagnostic Codes in use for many years are being superseded by a new set of ICD-10 Diagnostic Codes with substantially different data. Manage a major change like this as a small project, identifying stakeholders and system impacts, such as applications, integration, reporting, and so on.

Of course, any changes to reference data that was replicated elsewhere must also be applied to the replicated data.

Sometimes terms and codes are retired. The codes still appear in the context of transactional data, so the codes may not disappear due to referential integrity. The codes found in a data warehouse also represent historical truth. Code tables, therefore, require effective date and expiration date columns, and application logic must refer to the currently valid codes when establishing new foreign key relationships.

Sometimes codes are added to code tables prior to their effectiveness. For example, new codes that become effective January 1st may be added to their production code table in December, but not used by the application until the New Year.

By relating new codes to old codes, a data warehouse can depict not only how data aggregated historically, but also how the past might restate according to today’s coding structures.

Carefully assess the impact of reference data changes. If the term is being retired, approach all consumers of this data to mitigate the impact of such a retirement. Changes to relationships may affect existing integration and data aggregation rules. Changes to reference meta-data (business definitions, data sources, business data steward assignments, and so on.) should also be controlled, and in some cases, reviewed for approval, depending on the impact.

The key to successful master data management is management support for relinquishing local control of shared data. To sustain this support, provide channels to receive and respond to requests for changes to reference and master data. These same channels should also receive and respond to other kinds of requests, including:

  • New data source requests which ask to bring new information into the managed data environment.
  • Data content research requests for when there is disagreement by an information consumer on the quality of the data. To respond to these requests, business data stewards and data professionals need to look at from where and whom the information came, then follow up with corrective action or clarification in a timely manner.
  • Data specification change requests for change of business definitions or data structures. Such changes can have a cascading impact in application and business intelligence environments. Data architects, application architects, and business data stewards must review these requests, and the Data Governance Council may need to decide on a disposition of the request.

8.3 Summary

The guiding principles for implementing reference and master data management into an organization, a summary table of the roles for each reference and master data management activity, and organization and cultural issues that may arise during reference and master data management are summarized below.

8.3.1 Guiding Principles

The implementation of the reference and master data management function into an organization follows six guiding principles:

  1. Shared reference and master data belongs to the organization, not to a particular application or department.
  2. Reference and master data management is an on-going data quality improvement program; its goals cannot be achieved by one project alone.
  3. Business data stewards are the authorities accountable for controlling reference data values. Business data stewards work with data professionals to improve the quality of reference and master data.
  4. Golden data values represent the organization’s best efforts at determining the most accurate, current, and relevant data values for contextual use. New data may prove earlier assumptions to be false. Therefore, apply matching rules with caution, and ensure that any changes that are made are reversible.
  5. Replicate master data values only from the database of record.
  6. Request, communicate, and, in some cases, approve of changes to reference data values before implementation.

8.3.2 Process Summary

The process summary for the reference and master data management function is shown in Table 8.6. The deliverables, responsible roles, approving roles, and contributing roles are shown for each activity in the reference and master data management function. The Table is also shown in Appendix A9.

Activities

Deliverables

Responsible Roles

Approving Roles

Contributing Roles

6.1 Understand Reference Data Integration Needs (P)

Reference and Master Data Requirements

Business Analysts

Stakeholders, Data Governance Council

Business Data Stewards, Subject Matter Experts

6.2 Identify Reference Data Sources and Contributors (P)

Description and Assessment of Sources and Contributors

Data Architects, Data Stewards

Data Governance Council

Data Analysts, Subject Matter Experts

6.3 Define and Maintain the Data Integration Architecture (P)

Reference and Master Data Integration Architecture and Roadmap

Data Architects

Data Governance Council

Application Architects,
Data Stewards

Data Integration Services Design Specifications

Data Architects, Application Architects

IT Management

Other IT Professionals, Stakeholders

6.4 Implement Reference and Master Data Management Solutions (D)

Reference Data Management Applications and Databases, Master Data Management Application and Databases

Application Architects, Data Architects

Data Governance Council

Other IT Professionals

Data Quality Services

Application Architects, Data Architects

Data Governance Council

Data Analysts, Other IT Professionals

Data Replication and Access Services for Applications

Data Architects, Application Architects, Integration Developers

Data Governance Council

Data Analysts, Other IT Professionals

Data Replication Services for Data Warehousing

6.5 Define and Maintain Match Rules (P)

Record Matching Rules (Functional Specifications)

Business Analysts, Data Architects, Business Data Stewards

Data Governance Council

Application Architects, Subject Matter Experts

6.6 Establish Golden Records (C)

Reliable Reference and Master Data

Data Stewards

Stakeholders

Data Analysts, Data Architects, Subject Matter Experts, Other IT Professionals

Cross-Reference Data

Data Stewards

Stakeholders

Data Analysts, Subject Matter Experts

Data Lineage Reports

Data Architects

Data Stewards

Data Analysts

Data Quality Reports

Data Analysts

Data Stewards, Stakeholders

Data Architects

6.7 Define and Maintain Hierarchies and Affiliations (C)

Defined Hierarchies and Affiliations

Data Stewards

Stakeholders

Data Analysts, Data Providers

6.8 Plan and Implement Integration of New Sources (D)

Data Source Quality and Integration Assessments

Data Analysts, Data Architects, Application Architects

Data Stewards, IT Management

Data Providers, Subject Matter Experts

Integrated new data source

Data Architects, Application Architects

Data Stewards, Stakeholders

Data Analysts, Other IT Professionals

6.9 Replicate and Distribute Reference and Master Data (O)

Replicated Data

Data Architects, Application Architects

Data Stewards, Stakeholders,

Data Analysts, Other IT Professionals

6.10 Manage Changes to Reference and Master Data (C)

Change Request Procedures

Data Architects

Data Governance Council, Data Stewards

Other IT Professionals, Stakeholders

Change Requests and Responses

Data Stewards

Data Governance Council

Stakeholders, Data Analysts, Data Architects, Application Architects

Change Request Metrics

Data Architects

Data Stewards, Data Governance Council

Data Analysts, Other IT Professionals

Table 8.6. Reference and Master Data Management Process Summary

8.3.3 Organizational and Cultural Considerations

Q1: What is the primary focus for Master Data Management?

A1: Effective MDM solutions require continuing focus on people. Different stakeholders have different needs, different expectations, different attitudes, and different assumptions about the data and the importance of improving data quality. Data professionals need to be exceptionally good listeners, noting both the explicit and implicit messages communicated by stakeholders. Data professionals also need to be great negotiators, forging small agreements that bring people together toward a deeper, shared understanding of enterprise needs and issues. Data professionals must respect and cannot minimize local perspectives and needs in this process.

Q2: Do procedures and practices need to be changed in order to improve the quality of reference and master data?

A2: Improving the quality of reference and master data will undoubtedly require changes to procedures and traditional practices. Every organization is unique, and there are few if any approaches that will work well everywhere. Solutions should be scoped and implemented based on both current organizational readiness and the evolutionary needs of the future.

Q3: What is the most challenging aspect of implementing reference and master data management?

A3: Perhaps the most challenging cultural change is determining which individuals are accountable for which decisions – business data stewards, architects, managers, and executives – and which decisions data stewardship teams, program steering committees and the Data Governance Council should make collaboratively. Data governance involves stakeholders in making and supporting decisions affecting them. Without effective data governance and data stewardship, MDM solutions will be another data integration utility within the IT organization, unable to deliver its full potential and the organization’s expectations.

8.4 Recommended Reading

The references listed below provide additional reading that support the material presented in Chapter 8. These recommended readings are also included in the Bibliography at the end of the Guide.

Bean, James. XML for Data Architects: Designing for Reuse and Integration. Morgan Kaufmann, 2003. ISBN 1-558-60907-5. 250 pages.

Berson, Alex and Larry Dubov. Master Data Management and Customer Data Integration for a Global Enterprise. McGraw-Hill, 2007. ISBN 0-072-26349-0. 400 pages.

Brackett, Michael. Data Sharing Using A Common Data Architecture. New York: John Wiley & Sons, 1994. ISBN 0-471-30993-1. 478 pages.

Chisholm, Malcolm. Managing Reference Data in Enterprise Databases: Binding Corporate Data to the Wider World. Morgan Kaufmann, 2000. ISBN 1-558-60697-1. 389 pages.

Dreibelbis, Allen, Eberhard Hechler, Ivan Milman, Martin Oberhofer, Paul van Run, and Dan Wolfson. Enterprise Master Data Management: An SOA Approach to Managing Core Information. IBM Press, 2008. ISBN 978-0-13-236625-0. 617 pages.

Dyche, Jill and Evan Levy. Customer Data Integration: Reaching a Single Version of the Truth. John Wiley & Sons, 2006. ISBN 0-471-91697-8. 320 pages.

Finkelstein, Clive. Enterprise Architecture for Integration: Rapid Delivery Methods and Techniques. Artech House Mobile Communications Library, 2006. ISBN 1-580-53713-8. 546 pages.

Loshin, David. Master Data Management. Morgan Kaufmann, 2008. ISBN 98-0-12-374225-4. 274 pages.

Loshin, David. Enterprise Knowledge Management: The Data Quality Approach. Morgan Kaufmann, 2001. ISBN 0-124-55840-2. 494 pages.

National Information Standards Association (NISO), ANSI/NISO Z39.19-2005: Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies. 2005. 172 pages. www.niso.org

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset