Reference data management
Reference data refers to data that is used to categorize other data within enterprise applications and databases. Reference data includes the lookup table and code table data that is found in virtually every enterprise application: data such as country codes, currency codes, and industry codes.
Reference data is distinct from transactional data and master data. Transactional data is the data that is produced by transactions within applications; master data is the data that represents the key business entities that participate within transactions. Reference data is also distinct from metadata, which describes the structure of an entity. Transactional data, master data, and reference data when combined, comprise the key business data within an enterprise. Reference data is a part of enterprise applications from the beginning of the modern computing era. However, despite this fact and the fact that it constitutes a fundamental class of enterprise data, there is relatively little focus on reference data and its importance as an enterprise data asset.
Most enterprise applications contain reference data, built into code tables, to classify and categorize product information, customer information, and transaction data. Reference data changes relatively infrequently, but it does change over time, and given its ubiquity, synchronizing reference data values and managing changes across the enterprise is a major challenge.
Ad-hoc management of reference data without a formal governance policy can create significant operational risk. For many enterprises, reference data is a major contributor to enterprise data quality problems and has a high support cost. The demands of complying with national and international industry regulations are causing industry to rethink reference data management, and compelling the enterprises to manage and control their reference data by using sound data governance principles.
This chapter provides a definition of reference data, describes the problems that are associated with managing reference data, and provides an introduction and functional overview of the InfoSphere Master Data Management Reference Data Management Hub (InfoSphere MDM Ref DM Hub). The InfoSphere MDM Ref DM Hub is designed specifically to support centralized management and governance of enterprise reference data.
1.1 What is reference data
A simple definition of reference data is that reference data is data that is used to categorize other data within enterprise applications and databases. The database tables that store reference data within enterprise applications are usually referred to as lookup, code, check, or domain tables. Reference data is typically defined with a code and a description, and has a set of domain values, that is, a list of allowed values. Reference data is read-only data that is used by transactions but not changed or modified by those transactions.
Reference data can take the form of a flat list, for example, the list of US states, or can have a hierarchical structure over the code values, for example, a geographic hierarchy that includes country, state, and city. Reference data is used to classify and categorize transaction and master data.
Reference data is widely used within enterprise applications. Typically, both transaction data and master data have many types of reference data associated with them. Business application users see reference data populating the drop-down menus and selection lists within software application user interfaces. These selection lists constrain the user’s choice to one item from a list of allowed values, thus speeding data entry and reducing errors.
Reference data can range from general to being specific to industry, company, department, or even application as in the following examples:
ISO 3166-1 Country Code has applications across many industries.
ICD-10, the international standard for classification of diseases, is related to healthcare.
Fictional IBM Redbooks Company employee expense codes” might be specific to enterprise or application.
Many reference data standards are established to support interoperability between applications and organizations, for general commerce, and to support statistical analysis of data across organizations. A wide range of organizations maintain various reference data standards and publish standard code sets and classifications, and the updates to them, in a range of formats.
The standard code tables that are published by the European Union for statistical reporting and hosted on the RAMON Metadata Server illustrate the range of common types of reference data that are used for government and statistical reporting in Europe. See the standard code tables:
Figure 1-1 shows an example of the standard code lists that are hosted on Eurostat Metadata Server, RAMON.
Figure 1-1 Standard code lists (partial) hosted on Eurostat Metadata Server - RAMON
The standard code lists span a range of reference data domains, from the general, such as SCL-Languages, SCL-Currency, SCL-Days of the week, to industry-specific, such as SCL - Loading status [Transport-specific]; ICD -10 2007 [Health-specific]. Some of these lists are flat lists; others have hierarchical relationships between the codes. For example, SCL- Classification of Fields of Education and Training (1999) is hierarchical.
Standard classifications and nomenclatures such as NACE, SIC, NAICS, and many others are also listed on RAMON and are also reference data under the definition that was provided previously: reference data is data used to categorize other data within enterprise applications and databases.
A general characteristic of reference data is that it changes slowly relative to master data or transaction data. The list of countries in the ISO 3166-1 country code list does not change often. Changes to larger more complex sets such as ICD-10-CM are published on annual basis so they can be easily consumable. The relatively static and unchanging nature of reference data is one of the reasons that formal governance over this class of data within the enterprise is so often neglected.
A set of common-sense questions can be used to determine whether something should be treated as reference data, from a data governance perspective:
Does it categorize other data?
Is there a well-defined list of allowed values?
Is it unchanged by the transactions that use it?
Is it relatively static and slow changing?
Is there a requirement to manage a lot of additional properties along with the value?
For reference data, the answer to the first four bullets is yes; for the fifth bullet, that answer is typically no. Although data that does not meet these criteria might also benefit from centralized governance and stewardship, such data usually has governance concerns and characteristics that differ from those compared to reference data, and usually requires separate processes, policies, and tooling for its governance.
In fact, the answers to these questions are the key characteristics that differentiate reference data from master data under the usual definition of the terms.
Figure 1-2 shows the key differences between master data and reference data.
Figure 1-2 The key differences between master data and reference data
Variables are one type of data that falls outside of our definition of reference data and are used across the organization, such as tax rates and daily currency exchange rates. Such reference variables do not categorize other data and do not have a predefined list of allowed values. However, similar to reference data, they are used by transactional applications in a read-only fashion. Reference variables are typically authored and defined externally to the enterprise and might be used across many enterprise applications. Managing reference variables and reference data presents similar problems. In designing a governance program for reference data, including reference variables as a type of reference data usually makes sense. However, reference variables have slightly different characteristics compared to reference data, and can change more frequently. For example, currency exchange rates can change daily or more frequently.
1.2 Structure in reference data: Hierarchies and relationships
Reference data often has a hierarchical structure that is defined over the reference data values. Hierarchies are created by defining relationships between the values within an individual set. You can create a hierarchy by defining relationships over the nodes of an individual set, or you can create hierarchies by defining relationships between the values in different reference data sets. The topology of these hierarchies has different implications for the data stewardship processes.
1.2.1 Tree hierarchies
A tree hierarchy is a hierarchical structure over some or all reference data values within a set. A tree hierarchy has a parent-child relationship between the values. where each value is a node within the hierarchy. For example, within the NACE Industry codes, you see a hierarchy structure implicit within the code values themselves, representing the subcategorization of industry activities, as shown in Figure 1-3.
Figure 1-3 NACE Rev 2 hierarchy (partial)
1.2.2 Level-based hierarchies
With InfoSphere MDM Ref DM Hub, you can create hierarchies by defining relationships across various reference sets. Each level of the hierarchy is represented by a separately managed reference data set.
These hierarchies are referred to as level-based hierarchies. Level-based hierarchies are similar to tree hierarchies except that each level within the hierarchy is managed and defined as a reference data table or set in its own right. For example, Country  State is a geographic hierarchy where Country and State are usually managed as independent reference data sets.
Figure 1-4 shows the independent set structure of a level-based hierarch.
Figure 1-4 Independent sets structure of a level-based hierarchy
From a structure perspective, both tree and level-based hierarchies are the same. However, from a management and governance perspective, the two types of hierarchies have governance considerations that differ slightly. With a level-based hierarchy, changes can be made independently to the sets at each level as each set goes through its own change lifecycle. There might even be separate stewardship teams managing separate sets at each level. For level-based hierarchies, the stewardship process must take into account how changes to the underlying sets at each level are reflected in the hierarchies based on the values of the sets at each level.
A geographic hierarchy can be represented within InfoSphere MDM Ref DM Hub application as either a tree hierarchy or a level-based hierarchy. The distinction between them is how the relationships are defined between the terms and sets. A hybrid hierarchy structure can also be created where the hierarchy is defined using parent-child relationships within the values of a set and also between values of separate sets as in a level-based hierarchy. For example, a client might want to manage the nodes of the first three levels of a five-level hierarchy as a single set and use individual sets for the subsequent fourth and fifth levels.
1.2.3 Poly hierarchies
In some cases, a requirement might exist to create more complex relationships between reference sets and the values within those sets, to represent an ontology for example. Within IBM, the IBM Office of Chief Information Officer (CIO) uses the IBM Reference Data Management Hub to manage taxonomies of standard terms used within IBM internal and external facing applications. There is a business need to define relationships between the individual reference sets and values within those sets. The relationships are not always parent-child relationships, rather, they define a complex network, or poly-hierarchy structure. The InfoSphere MDM Ref DM Hub mapping capability supports creating these more complex cross-set relationships.
1.3 Challenges of managing reference data
Reference data is relatively static and managing the reference data in a single table over time might not seem to be a lot of work. This section outlines several reasons why the sum of managing reference data across all applications in the enterprise and coordinating changes and mappings across these applications is a major challenge.
Multiplicity of code tables and code table variations
Many of the same types of reference data are used across industries and applications and the reference data is commonly used and often defined in the limited context of a particular application. The result is that, within an enterprise, there are many different code set variations describing the same domain.
The differences between reference data sets across various systems can include semantic differences, coding scheme differences, format differences, and value differences. Even within well-established standards, there might be different representations for the same values. For example, the ISO 3166-1 standard defines three sets of country codes:
ISO 3166-1 alpha-2: two-letter country codes
ISO 3166-1 alpha-3: three-letter country codes, which allow a better visual association between the codes and country names
ISO 3166-1 numeric: three-digit country codes, which are identical to those developed and maintained by the United Nations Statistics Division
In addition to ISO 3166-1, many other country code lists are defined and used by other international organizations. Some differ completely from the ISO standard both in how the values are coded and the list of country code values contained in the set.
Another example of variability within a standard code set is demonstrated by NACE codes. The Statistical Classification of Economic Activities in the European Community, commonly referred to as NACE (acronym derived from its French name), is a European industry standard classification system consisting of a six-digit code. It is similar in function to the Standard Industry Classification (SIC) and North American Industry Classification System (NAICS) codes that are used in the US. The codes enable standard statistical reporting across industries and use of the standard is required in regulatory reporting. The first four digits of the NACE code, which is the first four levels of the classification system, are the same in all European countries. The fifth digit might vary from country to country and further digits are sometimes placed by suppliers of databases. Enterprises with operations in multiple countries often have to manage the different variations of the codes by creating reports within each country using the country specific standard version and then rolling up results to a single corporate version of NACE for its corporate level reporting. In addition, various versions of the NACE standard exist, and older data within a data warehouse might be classified according to earlier schemes:
NACE Revision 1, the first revision of the original NACE (1970)
NACE Revision 1.1, a minor revision of NACE Rev. 1
NACE Revision 2, adopted end 2006
Similarly in the US, NAICS is a different but analogous industry classification scheme that superseded the earlier SIC standard in 1997. However, some types of business are still required to provide regulatory reports that use the older SIC scheme so both SIC and NAICS are actively used in the US. A global organization with operations in both US and Europe might need to translate between NACE, NAICS, and SIC codes to meet various regulatory reporting requirements.
In summary, even for something as commonly used as country codes or industry classification codes, where well established standards are available, the variability among the standards and versions of the standards means that an organization might need to handle and reconcile data across multiple versions and representations as a matter of course. Moreover, the mapping between the codes in separate code-set variations are often not one-to-one. NACE and NAICS represent completely separate code schemes. Although the end purpose is similar, the industry categories within NACE do not correspond one-to-one with the industry codes in NAICS. The categorization structure differs, the coding structure differs, and the number of values within the code sets also differs. One NACE code might map to multiple NAICS categorizations and vice versa, making it difficult to determine how a piece of transaction data categorized in NACE should be categorized against NAICS.
To accurately map data across dissimilar code sets on a one to one basis, it might be necessary to use rules and additional information beyond the source code value itself to determine how to map the value.
Just as the reference data sets are published as standards, so the mapping tables that map between similar sets are also often published as related standards by the standards organizations. For example, the EU RAMON Server lists a large number of correspondence tables (that is, mapping tables) between various versions of code lists and classifications that are used in European statistical reporting. See the tables at the following website:
Although variations exist in and within industry standard code sets, the greatest source of code table variation within the enterprise is the enterprise applications where the code tables are defined and used. Reference data is typically created and managed in a siloed fashion within each individual IT application. In many cases, application developers implement code tables within an application with purely the local processing needs of that application in mind. Thus a hundred enterprise applications might have a hundred ways of representing a country code list, with various coding schemes and even various sets of country values within each.
Because each application typically has many code tables, the larger the enterprise, the greater the number of code table variations.
Maintaining mappings between reference data representations
If data and reference data were confined to individual applications, the problem of managing reference data within the enterprise would simply be the problem of how to manage the changes to reference data over time within each individual application. In reality, few applications work in isolation and data must cross application boundaries. Data is consolidated in master data hubs and data warehouses. Data passes from application to application in cross-business processes. Data that is entered in a web application by a client might result in transactions and processes in multiple back-end applications. Data with coded reference data values is received from suppliers, business partners, and customers as part of business transactions. Wherever transaction and master data flows between applications, so must the related reference data. Code tables that are used to categorize transaction data might have different formats and content within source and target applications for data movement. When transaction data flows from a source application to a target application, the code values of the source must be mapped to the corresponding code table values of the target if the categorization information that is associated with the reference data will be correctly carried also.
Typical use cases, where reference data mappings are key to the business, include data warehouse data load and master data hub data load.
Data warehouse load
Transaction and master data are consolidated and sent from back-end applications to the data warehouse. The data warehouse data is the source of business intelligence (BI) reports, financial reports, statutory reports, and so on. The data warehouse dimensions contain reference data and the reference data that is related to incoming fact data supports statistical analysis of the fact data. A common approach is to run a daily batch load job by using standard extract, transform, and load (ETL) tools, which use mapping tables to map the reference data codes from the source application format to the corresponding data warehouse format, so that the transaction data from back-end applications can be categorized correctly. The accuracy and completeness of the mapping tables might directly affect the accuracy of how fact data is mapped to dimensional data.
Without centralized management and coordination of reference data changes, the code table maps between the source application and data warehouse reference data representations are difficult to maintain in an ongoing basis and quickly become outdated. When reference data maps are incorrect or incomplete, the transaction data might not be posted correctly, or might not be posted at all. A common pattern for managing the case where a source reference code is not found in the map by an ETL job is to map the transaction data to a default “unknown” code for the particular dimension within the data warehouse. With this approach the quality of the categorization within the data warehouse declines over time, because the percentage of data that is correctly mapped to the respective dimensions declines.
Master data hub load and transactions
Master data encompasses the key business entities that participate in business transactions and operations including customers, products, and accounts. A master data entity, such as a customer, has reference data associated with it, and a typical master data management system has hundreds of code tables associated with the master data. A master data hub can be both a consolidation point and a distribution point for data relating to master data entities. Just as for the data warehouse, map tables allow reference data defined in external applications to be mapped to the code tables within the MDM hub so that master data that is sourced from those external applications can be loaded correctly into the MDM application. In addition, there might be a need to distribute data from the master data hub to other applications that require a reverse mapping.
The difficulty of managing reference data change
An enterprise has many variations of the same code set within the various enterprise applications and many code tables to manage. Although reference data is relatively static, it is not completely static, and managing changes to code tables across the enterprise can be a significant challenge. Many enterprise applications were never designed to accommodate changes to the code tables over time, and changing the tables will often require application development and testing.
Unlike master data management, reference data management is not an established discipline; many organizations have no central management of reference data.
Separate lines of business, application owners, and departments manage their own sets of reference data; there is often little communication or coordination between the various business areas. Rolling out a change across the enterprise is difficult. Also difficult is to understand or manage what individual changes are being made by the different reference data owners to their code tables.
Examples of changes to reference data are described next.
Changes to a standard
An example of changes to a standard is that a new country code is added to the ISO 3166-1 standard. When a standard is changed, an enterprise has to assess the following questions:
What applications does this change affect?
What will be the impact of adding the new code to each application and how will this be effected and coordinated?
What mappings are affected and need to be updated?
What is the contingency for managing data from applications that cannot be changed or that need to be changed on a staggered timeline?
How will this change be reconciled with data recorded prior to the change?
How will adoption of the changed standard by application owners be tracked and managed over time?
The impending healthcare industry change from ICD-9 to ICD-10 codes in the US is an extreme case of a code standard change. Healthcare organizations are currently engaged in major projects to ensure that their systems are ready for the change in October 2014.
However, even relatively small changes to reference data can cause problems because few organizations have a good understanding of how reference data is being used in each application and what the inter-dependencies are. The impact of making a code change in one or more applications can be difficult to assess.
Changes to an application-specific code set
When an application-specific code set is changed, an enterprise has to assess the following questions:
What other applications does this change affect?
What external mappings are affected and need to be updated?
How is this change coordinated across other application owners
Changes to mappings between sets
As an enterprise makes changes to reference data sets that are used within the application ecosystem, it must also update the cross-application reference data maps that are affected by a reference data change. These maps are used to drive mapping between one reference data representation and another and basically support interoperability between applications. Without governance over who is making changes to what reference data, coordinating changes to the related maps is difficult. As a result, the reference data maps that are used in ETL jobs and so on often become outdated.
Today, enterprises often catalog their reference data and the mappings between them by using Microsoft Excel spreadsheets and similar manual methods. Using a manual spreadsheet approach is error-prone and enforcing good data governance practices and policy around security, audit, and history is difficult.
1.4 The cost of unmanaged reference data
Today, many enterprises have no centralized enterprise governance over reference data; critical reference data is managed using spreadsheets and manual ad-hoc methods. The difficulty of managing change across the complex web of reference data variations is not systematically addressed; errors in reference data mappings and inconsistencies are accepted and tolerated as an everyday reality. Reference data variations and inconsistencies can be a major source of data quality issues within the enterprise and cause business losses through system downtime, incorrect transactions, and incorrect reports.
1.4.1 Costs related to business risk
The following business risks result from unmanaged reference data:
Operational risk
If incorrect reference data is used, transactions or applications might not function correctly and the outputs and results might not be as expected. Reference data often plays a key role in transactional applications and business processes so the potential business impact can be major.
Compliance risk
Regulatory compliance is increasingly a driver for adopting new approaches to managing reference data. More than ever, enterprises must be able to prove through internal and external audits about how their business processes work; that they work appropriately; that the data they incorporate is accurate; and that only authorized employees had access to the data. There has to be a clear audit trail and provenance for data used in regulatory reports.
Frequently, enterprises have only the most rudimentary ways of handling governance over reference data and the common approach of tracking reference data in spreadsheets and doing manual reconciliation is both time-consuming and prone to human error. Because the data stored within data warehouse and financial systems is used to generate financial and regulatory reports the accuracy of this data is a critical requirement. Incorrectly mapped reference data can have a direct impact on the quality and integrity of data that is stored within the data warehouse
Analytical risk
In addition to regulatory reporting, the data warehouse is used to drive business intelligence analysis and reporting. Reference data is fundamental to reporting because it defines the dimensions underpinning BI analysis. As BI reports are increasingly used to drive business decisions, business decisions might be made based on faulty analysis if transaction data that is mapping to dimensions is incomplete or incorrect.
Distribution risk
Reference data changes might not be correlated across related applications, causing data transfer problems, and reporting inconsistencies between applications.
1.4.2 Increased IT cost
Usually IT has the task of managing the reference data mappings, and there is little involvement from the business. Because reference data might be independently changed within source applications, data transfer mappings that are used to map reference data from one application to another become inaccurate and incomplete over time, potentially resulting in failed transactions, missing data, or incorrectly categorized transactions. IT often remains in a reactive mode, only recognizing and reacting to changes in back-end application reference data by the failures caused mapping and transaction problems.
Separate application owners make their own changes without addressing the implications of those changes on business processes or other parts of the organization.
1.4.3 Cost of business inflexibility
Reference data is a key aspect of any application integration, and the speed with which an enterprise can integrate the reference data from new applications and correctly map that new reference data to existing reference data standards has a bearing on business flexibility. Whether new applications are introduced through ongoing application acquisition or organizational mergers, the ability to accommodate new reference data can directly affect the speed with which the IT infrastructure of an organization can change. Without a formal governance program and the support of a reference data management application integration, reconciliation and ongoing management of new data and reference data sources is slowed.
1.5 Reference data governance with master data management approach
Master data management has become the best practice approach to address the problems that are caused by master data being defined and managed in many separate application and data silos within an enterprise. The goal of master data management applications is to support identification and reconciliation of master data records across an organization, and to provide well defined processes for managing the stewardship of the master data over time.
Master data management systems support identification and maintenance of a gold record for master data entities within the enterprise. Reference data by definition is not master data but it presents a similar problem: Reference data is defined and used independently across many applications and tends to be maintained in a siloed fashion on an application-by-application basis. Just as master data management systems can support formal, centralized governance and management of master data at an enterprise level, the same approach can be used to manage reference data. The emerging best practice is to treat reference data like a master domain in its own right and to provide centralized management and stewardship of reference data using a standard set of processes, policies and tools.
Master data management software provides the tooling foundation for building a specialized reference data management governance solution. IBM has designed a dedicated stewardship and hub application for the centralized management of enterprise reference data using a master data management approach. The IBM InfoSphere Master Data Management Reference Data Management Hub, first launched as a component of IBM InfoSphere Master Data Management Platform in July 2012, provides the repository, services, and stewardship user interface to support a complete governance program for enterprise reference data.
Managing reference data is a key aspect of an enterprise data governance program. The industry best practice for enterprise data governance in general is to establish a Project Management Office (PMO) with oversight of data governance projects. If a data governance PMO already exists, then reference data management project might fall under the auspices of that PMO. If not, then a reference data management project can be a good starting point for implementing a broader master data management initiative and data governance structure. IBM Lab Services and IBM Global Business Services® have plans and details for how to establish a data governance program so only certain key elements are covered here.
A data governance initiative for managing reference data should address four key elements:
People: Who owns the reference data and who is responsible for the data, both on the business side and within IT? The answer to this question will establish liaison relationships and simplify communication regarding questions and updates. Separate lines of business might have separate reference data requirements and typically, there will be one or more stewardship teams managing separate reference data sets and mappings. As with master data management, a centralized stewardship approach simplifies implementation at an enterprise level.
The requirements for how the data stewardship team is structured and how individual stewards can change what reference data will vary from organization to organization. One approach is to organize reference data stewards by business area: assigning a steward responsibility for reference standards for a particular business area. Business users should be able to raise change requests for additions or changes to reference data and the change management process should include routing the request to the appropriate steward who is then responsible for evaluating the change request and for ensuring the collaborative review and approval process depending on the type of reference data, and the type of change requested.
Process: What are the processes for stewardship and governance of the data? What are the processes for change management over the data? What are the processes for using the data in business scenarios?
 – Publishing of standards. At its heart, an RDM strategy must create the foundation for reference data standards across the enterprise, taking corporate, industry, and global standards into account. The process should support publishing reference data standards, and active and passive distribution of changes in reference data to the community of subscribing applications.
 – Change management: One of the key precepts underlying an RDM strategy is the ability to accommodate change. While some reference data remains static and changes relatively infrequently, other data might fluctuate on a daily or more frequent basis, for examples, currency exchange rates. For that reason, the enterprise must establish a procedure for accommodating change and updates. What are the change request processes? Who will manage the publication and dissemination of changes to systems that subscribe to the data? How will bidirectional changes between systems and applications be monitored and confirmed? Who among business and IT owners should be notified or consulted about changes? A typical process allows business users to direct the change requests that relate to the reference standards to the steward team and support a collaborative evaluation, review, and approval process for agreed changes.
 – Deployment and test strategies: How are changes to reference data coordinated across multiple applications? What is the deployment and test cycle for applications affected by reference data change? Publishing of changes to standards should include notification to business and application owners and can be scheduled in such a way as to allow those applications time to be changed to adopt the new standard (for example, monthly publish of standards changes).
 – Discover: Although reference data might be ubiquitous, it is not always obvious. An RDM strategy should include a way to discover reference data wherever it appears within an enterprise’s applications and spreadsheets. The result of the discovery process should be a catalog that profiles all existing applications, and there should be a process for bringing discovered reference data under centralized governance.
Policies: Organizations should establish policies around security, data ownership, audit, history, data retention, and other non-functional requirements for the reference data under management.
Tools: Technology provides the automation to help manage and implement the governance program. The IBM Reference Data Management hub is designed specifically to support governance, and stewardship of reference data including enforcement of data management policies. The InfoSphere MDM Ref DM Hub supports defining, managing, and publishing the “gold” or canonical reference data standards for the enterprise while managing synchronization and mapping of application specific reference data representations to the canonical representation and to each other. The InfoSphere MDM Ref DM Hub supports import and onboarding of reference data into the hub, and publishing of reference data and related changes to consuming and subscribing applications. The InfoSphere MDM Ref DM Hub works in conjunction with ETL tools, MDM applications, and many other data quality and data management tools. IBM RDM also integrates with IBM InfoSphere Business Glossary IBM business data dictionary. IBM InfoSphere Information Analyzer supports discovery and profiling of reference data in back-end applications so that the reference data can then be easily brought under management in the InfoSphere MDM Ref DM Hub.
1.6 InfoSphere MDM Ref DM Hub feature overview
The IBM InfoSphere Master Data Management Reference Data Management Hub (InfoSphere MDM Ref DM Hub) was released as a separately chargeable component under the IBM Master Data Management Product ID (PID) in July 2012. The hub was developed as a stand-alone reference data domain on the InfoSphere MDM Custom Domain Hub Platform which itself is the foundation for the InfoSphere MDM Advanced Edition. The InfoSphere MDM Ref DM Hub implements its own specialized domain model specifically for reference data, that is, reference data is supported as a first-class domain entity. The InfoSphere MDM Ref DM Hub includes a dedicated stewardship interface that is designed for managing reference data. The web-based user interface (UI) runs in the browser and no special code is required on the client. The UI is designed for business users, with intuitive and familiar navigation and controls. A flexible data model supports dynamic modelling of reference data properties through the UI ensuring a quick implementation and minimizing the need for IT involvement on an ongoing basis.
InfoSphere MDM Ref DM Hub is deployed as a stand-alone hub to provide a single point of management and governance for enterprise reference data.
InfoSphere MDM Ref DM Hub provides a robust solution for centralized management, stewardship, and distribution of enterprise reference data. It supports defining and managing reference data as an enterprise standard. It also supports maintaining mappings between the various application-specific representations of reference data that are used within the enterprise. The InfoSphere MDM Ref DM Hub supports formal governance of reference data, putting management of the reference data in the hands of the business users, reducing the burden on IT, and improving the overall quality of data used across the organization.
1.6.1 Key functions of the InfoSphere MDM Ref DM Hub
InfoSphere MDM Ref DM Hub is designed as a ready-to-run application. It can be quick to install, easy to use and understand, and delivers real value for immediate use without requiring extensive customization. The key functions include the following items:
Role-based user interface with security and access control including integration with LDAP
Management of reference data sets and values
Management of mappings and relationships between reference data sets
Importing and exporting of reference data in CSV and XML format through both batch and user interface
Versioning support for reference data sets and mappings
Change process controlled through configurable lifecycle management
Hierarchy management
InfoSphere MDM Ref DM Hub is built on the proven InfoSphere MDM platform and delivers a master data management approach to managing enterprise reference data. It helps to reduce business risk, improve enterprise data quality, and enhance operational efficiency. InfoSphere MDM Ref DM Hub is based on a three-tiered component architecture, comprising a client and a server application interacting with a back-end database that hosts the application-specific data and required metadata.
Figure 1-5 depicts a high-level component architecture of InfoSphere MDM Ref DM Hub.
Figure 1-5 InfoSphere MDM Ref DM Hub logical architecture
The InfoSphere MDM Ref DM Hub user interface is a web application UI that supports collaborative authoring of reference data. Reference Data Stewards use the RDM web UI for the importing, managing, and publishing of reference data sets. The role-based UI allows a stewardship team to view, author, map, and approve reference data sets within a central repository. With this approach reference data sets can be created and managed in a controlled manner. User actions on the web UI trigger requests, which are handled by appropriate service controllers present in the REST layer. The REST layer services invoke the server-side transactions to manage CRUD procedures on RDM database.
The server-side is implemented on the proven InfoSphere Custom Domain Hub engine (the same engine that powers InfoSphere MDM Server and InfoSphere MDM Advanced Edition).
The reference data domain model elevates reference data to be a first class domain entity within MDM. By implementing the InfoSphere MDM Ref DM Hub as a new domain on the InfoSphere MDM platform, the InfoSphere MDM Ref DM Hub benefits from a wide range of base services and ready-to -use frameworks that InfoSphere MDM provides such as business rules, event notification, data quality, and audit history. In addition, several reference data management specific services are implemented to achieve key functionality such as import and export, reference data set lifecycle management, transcoding, distribution, and versioning.
The client and server enterprise archives reside in a WebSphere Application Server instance. The currently supported databases are IBM DB2 and Oracle.
1.6.2 Understanding reference data sets
Reference data sets are at the heart of a RDM system and are used to manage and contain reference data. Every data set is associated with a reference data type that defines the properties of the data set.
When a reference data set is created, it is automatically given a version number, starting with 1. You can change the version number to another value, numeric, or alphabetic. When you create new versions, you can give them any label to indicate the version number. Each reference data set has at least one version. All versions of a data set are grouped together. For ease of management, reference data sets can also be grouped in folders.
The InfoSphere MDM Reference Data Management Hub user interface uses a drag and drop hierarchy widget to visualize and manipulate the reference data set folder hierarchy, and the reference data sets within them. Folders, the reference data sets that the folders contain, or individual reference data sets can be dragged from one folder to another.
Organize sets using folders
Folders are an aid for organizing and navigating reference data sets. A reference data steward can give the folders meaningful names that suit the context of the work environment, such as named for the project they belong to, the person who created them, or a date. Folders can also contain child folders to further help organize reference data sets.
The folders are listed in alphabetical order. Scrolling down the Folder View, any reference data sets that are outside of a folder are listed after the folders.
Figure 1-6 shows the reference sets, organized into folders in the InfoSphere RDM user interface.
Figure 1-6 Reference data set folder view
Reference data set versions
Reference data sets support versions. Versions are used to provide a structured approval process for changes to reference data before those changes become active.
Data stewards can also create a copy of a reference data set as a new version of the original, preserving a set before changes are made. In the event the changes are unwanted, the later version can be deleted to restore the previous version.
Lifecycles and states
When creating a reference data set, a steward assigns a lifecycle process with that set. Every reference data set and mapping has a lifecycle process that defines the states and state transitions governing changes to the set.
Each lifecycle has a set of states that correspond to the steps in the change, review, and approval process. Four lifecycle processes, listed next, are available and ready for immediate use with the InfoSphere MDM Ref DM Hub application. Various lifecycles can be defined at implementation time to suit an organization’s specific governance processes.
Simple Approval Process
Simple approval process defines a lifecycle where changes might be made to a draft version of a reference data set, and a single approval is required to publish the change. The set or mapping starts in Draft state allowing an authorized data steward to make changes. When the steward completes making changes and is ready to publish the change, the steward submits the change for approval that changes the state to Pending Approval. A user with the Approver role can then review the changes and approve or reject the changes that in turn changes the state of the set or mapping to Approved or Rejected. The set or mapping can be in the Retired state when it is no longer used, and from the Retired state to the Dropped state, disbarring any further edits.
Simple approval process is the most commonly used process for the lifecycle of a reference data set. The simple approval process includes the following states:
Draft: This is the initial state for a reference data set or mapping. Data stewards request approval by changing the state to Pending Approval.
Pending Approval: The reference data set or mapping is awaiting approval from a user with the Approver role.
Approved: The reference data set or mapping is approved for use. Whether the data set or mapping is available to consumers also depends on its effective date and expiration date. Sets and mappings in the Approved state cannot be edited.
Rejected: The reference data set or mapping is not approved for use. It can be edited in the rejected state by the data steward, or it can be changed to the draft state. To submit the rejected reference data set or mapping for approval, the state must be changed first to Draft, and then to Pending Approval.
Retired: The reference data set or mapping is no longer used, although it can be edited, and new data set versions can be created from a retired data set.
Dropped: The reference data set or mapping can no longer be edited, and its state cannot be changed. New data set versions can be created from a dropped data set.
State Machine - 2
The mapping or set using this lifecycle process has only two states, Draft and Approved:
Draft - 2: This is the initial state for a reference data set or mapping. Data stewards request approval by changing the state to Approved - 2.
Approved - 2: The reference data set or mapping is approved for use. Whether the data set or mapping is available to consumers also depends on its effective date and expiration date. Sets and mappings in the Approved - 2 state cannot be edited.
State Machine - 2 process might be appropriate where a single person or steward is responsible for editing the mapping or set. In that case, there is no need for a separate approver. This state machine allows changes to be made to a draft copy before being published.
Active Editable
The set or mapping using this lifecycle process is immediately available for use, and can be edited at any time. Active Editable might be appropriate for certain types of reference data where the changes are made automatically by a systemic process and do not require stewardship intervention or a review and approval. Active Editable process supports changes directly to the active sets. Because there is no approval process, the State property is disabled.
Two Step Approval
The mapping that uses this lifecycle process requires two approvers before it can become available for use. The two-step approval was designed for customers with more rigorous governance controls who require a multistep approval process. This lifecycle process includes the following states:
Draft: This is the initial state for a reference data set or mapping. Data stewards request approval by changing the state to Pending First Approval.
Pending First Approval: The reference data set or mapping is awaiting approval from a user with the Approver role. The Approver can change the state to Pending Second Approval or Rejected.
Pending Second Approval: The reference data set or mapping is awaiting approval from a user with the Approver2 role. The Approver2 can change the state to Approved or Rejected.
Approved: The reference data set or mapping is approved for use. Whether the data set or mapping is available to consumers also depends on its effective date and expiration date. Sets and mappings in the Approved state cannot be edited.
Rejected: The reference data set or mapping is not approved for use. It can be edited in the rejected state by the data steward, or it can be changed to the draft state. To submit the rejected reference data set or mapping for approval, the state must be changed first to Draft, and then to Pending First Approval.
Retired: The reference data set or mapping is no longer used, although it can be edited, and new data set versions can be created from a retired data set.
Dropped: The reference data set or mapping can no longer be edited, and its state cannot be changed. New data set versions can be created from a dropped data set.
Reference values
Reference data sets contain rows of data values, which might be only a few rows in small data sets to many rows in large data sets, although data sets typically have fewer than one hundred thousand (100,000) rows. The data can be managed by using the InfoSphere MDM Reference Data Management Hub console, data imports, or through updates applied using the batch processor or web services.
The UI supports both simple and advanced filtering to limit what values are displayed for a set.
Each row in the data set can contain multiple properties, as defined by the data set type, and support a number of language translations. In addition, data values can participate in mapping and hierarchy associations.
Data deletion is always handled with a soft deletion mechanism. Data is never physically deleted from the InfoSphere MDM Reference Data Management Hub database on a delete action.
Reference data set translation
Reference data sets can have multiple language translations for each reference value in the set. The translations of a particular reference value can be provided manually on the Translations tab in the Set Values view. Alternatively, the translations can be imported from a comma-separated values (CSV) file.
Reference data set hierarchies
The InfoSphere MDM Reference Data Management Hub console supports the creation of two types of hierarchies over reference data:
Tree-style hierarchy
Level-based hierarchy
Tree-style hierarchies
Reference data set hierarchies within InfoSphere MDM Ref DM Hub are tree-style hierarchy structures, created over the values within a reference data set. You access the hierarchies by using the Set Hierarchies view.
The following actions are available from the Set Hierarchies view:
Save or discard changes to a hierarchy.
Select the hierarchy to work on by selecting Set Hierarchies from the View menu.
Create a new empty or pre-populated hierarchy over the values in a reference data set.
Import a hierarchy definition from a comma-separated values (CSV) file.
A hierarchy structure can be defined between the elements of a set in a CSV file and imported over a set of reference data values. When dealing with hierarchies over a large set of values, an easier approach is to define the hierarchy externally, and create a CSV file to import the hierarchy into InfoSphere MDM Reference Data Management Hub.
Copy a hierarchy.
Delete a hierarchy
Refresh the list of hierarchies
 
Note: Each reference value can exist only once within a hierarchy.
Level-based hierarchies
InfoSphere MDM Ref DM Hub supports level-based hierarchies across multiple sets, where each level of the hierarchy is associated with a different reference data set.
An example of such a level-based hierarchy is city/state/country where city, state, and country are each reference data sets. Managing the relationships between the values across the sets is both valuable and meaningful.
Reference data sets can be linked together by using the reference data set property type. The values from one reference data set can be surfaced as a lookup table within the value properties of a related reference data set. For example, a State reference data set has a property for Country, and this property is of type reference data set. State and country become linked by using the Country reference data set as a lookup table within State values. This linkage creates a value-based hierarchy.
The first step in supporting a level-based hierarchy is to create the appropriate reference data set types using the administration tab in the InfoSphere MDM Ref DM Hub UI.
Reference data set subscriptions
Subscriptions form part of the definition of the data distribution model within the InfoSphere MDM Ref DM Hub. The basic concept of subscriptions is that it allows properties to be set that will drive how application subscribers subscribe to the data within a particular reference data set.
A subscription is an association made between a reference data set and a managed system; this association indicates who the publishers and consumers of the reference data set are. Therefore, it is possible for a reference data set to have many subscriptions, because you have a single publisher and many consumers. Subscriptions are useful when you do import or export operations for the associated reference data set.
Figure 1-7 shows subscription properties for a reference data set.
Figure 1-7 Subscription properties for a reference data set
Reference data set history
History data that reflects all the changes to a reference data set or map is maintained within the InfoSphere MDM Ref DM Hub in special history tables. This history data can be retrieved through use of customized database SQL calls.
Data that is deleted through the user interface is not physically deleted from the InfoSphere MDM Ref DM Hub database, but is marked as logically deleted. (This process is often called a soft delete.) Deleted data is not returned through the InfoSphere MDM Ref DM Hub inquiry services or user interface, but can be retrieved, for audit purposes, from the InfoSphere MDM Ref DM Hub database.
1.6.3 Understanding reference data types
Every reference data set (data table) or mapping that is created within InfoSphere MDM Ref DM Hub has a default set of properties. These reference data set and mapping properties are defined by using reference data types. Every data set and mapping is associated with a data type definition.
The InfoSphere MDM Ref DM Hub system includes two basic data types, Default Reference Data Type and Default Mapping Type, which can be used to define a simple reference data set or mapping. An authorized user can create reference data types with additional properties to define more complex reference data sets or mappings.
Properties that are defined in a reference data type can be applied at the reference data set level, or at the value level for sets and mappings. At the value level, properties apply to all the rows of data in a data table or to all value mappings.
Reference data types have a data type property, which designates whether the reference data type is used to create reference data sets or mappings. After the reference data type is saved, its data type cannot be changed.
Business keys
Reference data types can be defined with compound keys. Up to four properties plus the code can be defined as constituent key parts of a unique reference data value key. Business keys apply only to reference data set data types; mapping set data types do not use business keys.
The business key feature is enabled for a reference data type in the Administration menu in the InfoSphere MDM Ref DM Hub UI. Selecting the business key check box for a reference data type is one way to ensure that the code for a reference data set is unique.
 
Note: Name is also a required field for each reference value but is not used in determining uniqueness of the compound key.
InfoSphere MDM Reference Data Management Hub supports up to five properties within a compound key for a set, where Key 1 is always the set code. The other four properties must have the Key property enabled within the reference data type definition.
Compound keys allow records with the same code to be saved, if the combined key is unique. The uniqueness check is done against the full key that is defined for the type. If the overall compound key is unique, the record can be stored.
Importing values into a reference data set, defined as having a compound key, is the same as importing values into any other set. The same rules hold for the manual case: the key values cannot be null and the overall combination of keys and code must be unique for each value.
The business key check box and compound key definition can be applied for a reference data set with existing values if the values conform to the rules defined for the keys. The InfoSphere MDM Ref DM Hub checks for data integrity when the compound key settings for a reference data set type are changed, and does not allow actions that would create duplicate entries in the database where uniqueness is required.
Uniqueness is not enforced for the Code property unless the business key option is selected. Uniqueness is not enforced to support effective date-centric use cases that require multiple entries for the same code with different effective dates.
 
Preferred: For most reference data types, the business key is preferred.
Reference data set data types
When you create a data type property of type Reference Data Set that points to a data set in the system, the options available must be from a current version of that set. A current version of the set is required to fill that property in a reference value.
The current version of a set is one that meets all of these conditions:
The state is in an approved state for the lifecycle process or state machine.
The effective date is the current date or earlier.
The expiration date is after the current date.
If more than one version meets these conditions, the most recent version is considered the current version.
 
Dates: The effective date and expiration date for individual reference values are not relevant to the definition of a current version.
Example
You create a reference data set named Countries, which contains reference values for country names. The status for the Countries data set is Approved. Its Effective date was last month, and its Expiration date is next year.
Next, you create a reference data type named Branches, which includes a Value Level property named Country. The data type for the Country property is Reference Data Set, and its Related set is the Countries reference set.
When you create a reference data set named Branches, basing it on the Branches reference data type, the Country property is populated with the values from the Countries data set. Note that the Country property is populated with the values from the Countries data set only if the Countries data set has a current version.
1.6.4 Mapping reference data sets
Mappings can be defined between two reference data sets, and the reference values within those sets, to relate associated data.
For example, NACE is the European standard for industry codes, and NAICS is the North American standard. You can map a reference data set that contains NACE codes to one that contains NAICS codes.
The mappings browser is where you perform the primary management tasks for mapping sets. With the mappings browser, you create, read, update, and delete a mapping between two reference data sets and map reference values in those data sets.
Each mapping has its own metadata, including version and lifecycle state. For instance, the version for a mapping is unrelated to the version of the sets that are being mapped. You can easily locate a map by filtering for the source or target set associated with the map.
Every reference data set and mapping has a state that corresponds to its lifecycle process. The lifecycle is used by the data administrators to control the versions of reference data sets and mappings that are in use.
1.6.5 Managed systems
Managed systems are references to external systems that represent the suppliers and consumers of the reference data defined within the InfoSphere MDM Ref DM Hub.
Managed systems support configuration of the integration with applications and other entities that provide data to InfoSphere MDM Ref DM Hub or that subscribe to data exports from the hub.
You can set properties for the managed system to enable InfoSphere MDM Ref DM Hub and integrated applications to communicate with the managed system. Managed systems are associated with reference data by subscribing to a specific reference data set.
Each managed system can have custom properties that assist your external integration services personnel in providing reference data to external systems. Integrating managed systems requires custom code as part of the implementation.
1.6.6 Batch export
You can install and run the batch export function from any supported system that has access to the InfoSphere MDM Ref DM Hub. The hub can be on the same physical system that is used to perform the export, or it can be on a remote system.
Batch export fully supports exporting the following entities:
Data values
Data types
Translation
Mapping
Hierarchies
Batch export can produce CSV or XML formatted output. The formatted output is compatible with the CSV and XML import and export functions accessible with the user interface.
The resource definition and control for batch export is handled with a properties file and command line options. You use the properties file and command line options to specify the type and name of the resource that is exported. A log file is created that contains the result of the export and statistics on the number of rows exported.
Batch export properties
A sample properties file, named RDMBatchExportClient.properties, is provided with the InfoSphere MDM Ref DM Hub installation. The options in this file are used to specify parameters for running the batch export.
The options in the properties file have the following general format:
<option> = value
Table 1-1 lists the options that apply to batch export.
Table 1-1 Batch export options
Option
Description
provider_url
Provider URL of the server instance in the host:port format.
Example: myserver.ibm.com:9080
target_dir
File output directory.
Example: C:output
format
File format for the output. Supported formats are CSV and XML. Example: format = csv
export_separator
CSV delimiter, such as “,” or “|” delimiter.
Example: export_separator = ,
export_wrapper
CSV wrapper is double quotation marks.
Example: export_wrapper = "
timestamp_format
A string that defines the format for time stamps used in CSV exports. When exporting XML, batch export uses the time stamp format mandated by the XML specification.
Example: timestamp_format = "yyyy-MM-dd'T'HH:mm:ssz"
timezone_offset
Used in CSV exports to designate the offset that the values have from UTC.
Example: timestamp_offset = +00:00
date_format
A string that defines the format for dates.
Example: date_format = yyyy-MM-dd
max_translations
The maximum number of translations to output in CSV files. A value of 0 instructs the software not to export any translations. For XML formatted output, all translations are exported unless this value is 0.
Example: max_translations = 10
hierarchy_export
For hierarchy exports, “tree” or “list” options are available. Example: hierarchy_export = list
user
User credentials for web service authentication.
Example: user = tabs
password
User credentials for web service authentication.
Example: password = tabs
requestid
Request control parameters to identify the request.
Example: requestid = 100101
requester_name
Request control parameters to identify the name of the user that makes the request.
Example: requester_name = Some Name
request_lang
Request control parameter to identify the request language.
Example: request_lang = 100
page_size
Page size used while retrieving data from the server. Identifies the number of rows to be returned in a single request.
Example: page_size = 250
debug
When set to false, no debug information is produced. When set to true, debug information is written to the file debug.log in the output directory for the requested file type. If the debug.log file cannot be created, an error message is written to the command window used to start the batch export.
Example: debug = true
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset