InfoSphere RDM model design
The purpose of this chapter is explore the InfoSphere Master Data Management Reference Data Management Hub (InfoSphere MDM Ref DM Hub) model in more detail and show how some of the concepts work. First, several of the more complex capabilities within the InfoSphere MDM Ref DM Hub are examined, with descriptions of how the features work.
Next, how to use the features within InfoSphere MDM Ref DM Hub to handle various data modeling issues with mapping reference data to InfoSphere MDM Ref DM Hub is described. Also, the various approaches are explained along with the trade-offs between them.
5.1 InfoSphere MDM Ref DM Hub model
Underlying InfoSphere MDM Ref DM Hub is a data model that gives the structure to support the storage of reference data in a way that can be controlled and consistent. A rich set of services allow the data to be maintained, and a user interface (UI) provides ready access to managing reference data.
Figure 5-1 shows a simplified diagram of several major entities of the RDM data model. The managed entities are described in the following sections.
Figure 5-1 Simplified diagram of some of the major RDM data model entities
5.1.1 Managed entities
Managed entities are reference data sets, the values within data sets and set mappings. These artifacts are proactively managed for authoring and administration in InfoSphere MDM Ref DM Hub. The principal means of administering managed entities are through the use of versioning, lifecycle, and ownership functions. Other parts of the application, more client-specific, but still relevant to managed entities include the chosen folder structure, the naming conventions that are used, and the added custom properties.
Managed entities allow different versions to exist and the entities have a governance process around which versions are to be used for a given purpose at any given time. Typically, in an operational InfoSphere Master Data Management (MDM) system, there is only one copy of an entity. There might be historical copies that are used for auditing purposes, but there is only one active copy of the entity. In InfoSphere MDM Ref DM Hub, there can be many versions of an entity and more than one can be active at the same time. The business decides which version is active. For example, a catalog list of fashion products might have Spring and Summer versions both active in May and June.
In an authoring environment, there is a need to know what the current version is while you might have one or more future versions waiting to become active or being working on by data stewards. In InfoSphere MDM Ref DM Hub, entities that are managed have certain properties and attributes that allow for both holding multiple copies and facilitating the data governance process.
There are two main entities in InfoSphere MDM Ref DM Hub that are subject to versioning: reference data sets and reference data mappings. These managed entities have several child entities that are also versioned along with the parent entity.
Versioning and lifecycle ownership
Versioning support allows for multiple copies of an entity. Versioning is modeled on top of the MDM base entity support. Attributes are added to store version label and lifecycle state. Versions of an entity are copies of the entity. The MDM framework assigns a unique primary key identifier (pkID) to every instance of entity. Because versions are entity instances, each has its own pkID. Also, a common identifier is added; it is called a baseID. The baseID is the same in all versions of an entity. You use the pkID to access a specific version or use baseID to refer to all versions of the entity.
Ownership allows the system to store a list of user groups that are allowed to work with entities. The owner field is a list of group names. When accessing a managed entity, the user registry is checked to determine whether the user requesting the access is a member of one of the ownership groups. InfoSphere MDM Ref DM Hub supports the same user registries that WebSphere Application Server does (local and LDAP based).
The user registry is then checked to determine whether the user is a member of these groups. If not, the user is given read-only access. A group.properties file configures the list of group names that are allowed to be owners in InfoSphere MDM Ref DM Hub. This list is a subset of all the groups available in your registry system.
Lifecycle adds the ability to apply a governance model for how versions are handled. It identifies when a version is approved and active, when it is in a draft state, pending approval, or in testing states. InfoSphere MDM Ref DM Hub includes sample state machines (Simple Approval, Two Level Approval, Active Editable, and so on). The built-in state machines can be modified or new sate machines added. Each managed entity is assigned a state machine when it is created. The entity is created in the default or draft state specified by the state machine configuration.
Lifecycle actions are then applied to the entity to move it from one state to the next. Security rules control which lifecycle actions can be performed by each user. This allows users who can edit entities in a draft state to perform the request approval action, but then not be allowed to perform the approval action. Approvals require a user in the Approval role.
Lifecycle attributes that are added to managed entities consist of a state machine ID that identifies which sate machine is being used. It is the state machine which holds the set of states and their transitions that apply to an entity. Each state machine reflects the governance model being applied to the entity.
Children of managed entities
The managed entities can contain child entities (versions and copies of that set) that also need to be managed. For example sets contain reference data values. When a new version of the set is created, the values must be copied from the original set to the new version of the set. Contained children of managed entities do not necessarily inherit all of the same attributes, because children can be modified independently. Some of the management concepts, such as version, owner, and lifecycle are always inherited from the parent-managed entity.
These contained entities must be referenced in a version-independent way. To do this reference, entities that are versions of the same set also have a common baseID attribute. For example, the baseID in values is the same for all values that are the same but in different versions of the set. The pkID of these values differs in each version.
By using the pkID, you can refer to a specific entity. By using the baseID, you can refer to all versions of that entity. In InfoSphere MDM Ref DM Hub, you see both of these IDs used in different ways.
The parent Set of Values refers to the pkID of the set. A value can exist only within a single version of the set. For relationships between values, the baseID is used. If Texas is in the United States, then we want the baseID to refer to all versions of United States.
Figure 5-2 on page 103 shows two versions of a Country Set. The Abstract Country is not persisted in InfoSphere MDM Ref DM Hub. Each version of the country set has the version that is specified by pkID, and baseID. The baseID is the same for both versions. The country set contains two values in each version. The values contain their own pkID and a baseID. Again, the baseID is the same for different versions of the same value.
Figure 5-2 Two version sets with different value
Business keys and compound keys
InfoSphere MDM Ref DM Hub supports having a business key for values. The business keys represent external unique identifier for the values. By default, the code field from any reference set is used as the business key for values. The system ensures that there are not multiple values with the same key. There is also support for compound keys. Optionally, up to four additional attributes can be selected as part of the key of a value.
Business keys are configured in the reference data set type. There is an option to disable business keys in the data set type definition but this not recommended, and this feature may be removed in a future release. When adding more properties to the data type, the choice is whether to make the property part of the key or not. Making a property a part of the key will also make it a mandatory field. When importing data, the business keys are used to find existing values and to update the existing values, if the business keys are not found (they are new in the import), new values are created. Business keys are also used to find values when establishing relationships between values for hierarchies and mappings. When importing data, the business keys are used to find existing values and will update the existing values, otherwise, new values are created. Business keys are also used to find values when establishing relationships between values for set properties, hierarchies, and mappings.
Hierarchy
Two types of hierarchies can be created: hierarchy over the values of a set; level-base hierarchy. See Chapter 3, “Planning a RDM project” on page 53 for more detail.
Hierarchy over the values of a set
Within one set, any number of hierarchies can be built, with no limit on the number of levels. These hierarchies may contain all or some of the values in the set.
Level-base hierarchy
One set can be linked to a field in another set. For example, you have a set of Towns, within which there is a field for adding Province. That field can be linked to another set where all provinces are held, and only provinces from that set can be chosen to appear in the Towns set.
Mapping
Any reference data set can be compared to any other reference data set in a mapping. It is a business decision as to whether there is any benefit in creating a mapping. For example, two sets of product codes might exist in two established systems, both of which actually have the same meaning and must be combined in a data warehouse. A mapping joins two sets and each of the values within those sets can be mapped to an equivalent value in the other set. Not all values must be mapped, and multiple values from one set can be mapped to the same value in the other set.
5.1.2 Types
InfoSphere MDM Ref DM Hub contains types so you can customize your reference data. There are two types: one for sets and another for mappings. The types allow you to configure options for how your sets and mappings behave.
Custom properties
One of the main reasons to use types is to allow custom properties to be defined for sets and mappings. InfoSphere MDM Ref DM Hub has common attributes on both sets and mappings. However, for various kinds reference data, you might need to define and capture different properties about the reference data. When needed, additional properties can be created for reference data sets and mappings.
The supported data types for custom properties are as follows:
String: Any character data. The contents can be constrained by applying a regular expression to validate the content.
Text: A multiple-line list of characters.
Integer: A valid integer.
Date: A valid date, containing day, month and year.
TimeStamp: A two-part value that consists of a valid date and valid time of day.
Reference data set: A relationship to another reference data set. The range of values taken by this relationship is decided by the reference data set that it points to, such as a relation called hasState from a City set to a State set.
Boolean: A value representing a true or false state.
URL: A valid URL string.
Validation rules
Certain fields in InfoSphere MDM Ref DM Hub can contain validation rules. Regular expressions are used to limit what data can be entered in the fields. Validations are available for the code field and for string type custom properties.
You can use the following example rule for code to limit the field to two uppercase characters:
[A-Z]{2}
5.1.3 Ancillary entities: Format
When exporting sets, InfoSphere MDM Ref DM Hub allows you to define a format for how you want to export the data. The format contains which subset of attributes you want to include, what order to include them, and if you want to change the name to include in the export. You can use the format one time during the export process or format can be persisted for future exports.
Formats are associated to the set and not a set version, so a format saved against one version of a set is available for all versions of the set.
5.2 InfoSphere MDM Ref DM Hub model design considerations
When working with InfoSphere MDM Ref DM Hub, you must make design decisions about how to model your reference data. Within InfoSphere MDM Ref DM Hub, you can define sets, mappings, and set hierarchies.
5.2.1 Versioning and implicit versus explicit relationships
There are two ways (implicit and explicit) to establish relationships between various kinds of reference data value properties and reference data mappings. Through the use of types and properties, relationships can be defined on values that link the values in one set to a value in another set. These are one to many relationships. One value can be linked to by many other values in another set. For example, if you have a City set, you can define a property to define what state the city is in: Austin is in the state of Texas. Relationship properties are part of the set containing the property. When you import the set, you can also import the relationship as part of the set.
However, reference data mappings are maintained independently as managed entities. Reference data mappings provide a way for you to define relationships across reference data sets and set values. In addition to the managed entity properties, a mapping has other properties to represent source and target sets.
After a mapping set is created, selecting a source value from a source set and a target value from a target set can create value mappings. The idea of source and target is a business interpretation of which set you are working “from” and “to,” according to which business process is being considered.
Mappings are imported independently from the source and target sets.
Mappings can also contain custom properties so you can add information that can be helpful in deciding how to use the relationships.
Properties on the mapping can also be relationships to other sets. For example, you might have a mapping relationship to another mapping.
5.2.2 Versioning strategy
A good versioning strategy relies on an understanding of the relationships between separate versions and knowledge of the business processes that determine which version to use under what circumstances.
A common issue with the relationships between versioned data is how to determine which version of the data to use. When there is a change to a Country, which version of the Country does a particular State refer to? Logically all the versions of the State are the same entity, however when you are resolving a relationship you need to know which version to use.
The two ways that relationship target versions are determined in InfoSphere MDM Ref DM Hub are implicit and explicit version selection (Figure 5-3).
Figure 5-3 Relationships and versioning showing Implicit versus Explicit
Implicit version selection
With implicit version selection, RDM applies a business rule to determine which version of a target set to use. Implicit version selection is a simpler model to manage. Extra data does not have to be stored identifying specific versions. When new versions of the reference data are made, data stewards do not have to update the relationships to use the new versions.
For example, if you create a new version of a Countries set and approve it to be the new active version, with implicit version selection all the relationships to the country automatically work against the new version. Without this kind of a rule, a user must update the relationships to reflect the new version of the data. In the case of a new Countries set version, a change is required to the State set, where the relationship between States and Countries are stored.
The relationship between sets, which use properties, is using implicit version selection. The business rule that is applied is to use the current version. The current version of a set is the one that is in an approved state and is within the applicable start and end dates. If there are multiple approved and active versions, then the most recently modified version is the current version.
Implicit version is also used in the subscriptions. Subscriptions are a relationship between a Managed System and one or more reference data sets. With subscriptions, you want to link to reference data in a particular state such as Approved or Test. To do this step, a lifecycle state attribute is available in the subscription, and which can be used to identify the version that a subscription is for. Updating subscriptions each time there is a new version of a Set is not a task you want to do; InfoSphere MDM Ref DM Hub can be configured to apply this information as a rule. To summarize, in implicit version, the system uses a business rule to compute the version that should be used.
The advantages of using implicit version are as follows:
Relationships evolve automatically as the target of the relationship goes through its lifecycle.
Data steward work is simplified and easier to maintain.
The disadvantages of using implicit version are as follows:
Set must be approved before it can be used in a relationship.
Determining which version you are looking at without examining all possibilities is difficult.
Referential integrity rules are needed to prevent inconsistencies.
Explicit version selection
With explicit version selection, a user explicitly specifies which version is used in a relationship. In healthcare, there are mapping rules to map between various versions of the healthcare diagnostic coded ICD 9 and ICD 10. Updates to the standards and the mappings between them are released on a regular basis. These mappings must match the same version of the standard that they are released for and not automatically change to any new version that is created. Explicit version selection allows a user to specify the exact version to be used and does not change unless a user explicitly changes the version.
In RDM, explicit version selection is used in Reference Data Mappings to identify the source set version and the target set version that is used in the mapping. To summarize, in explicit version, the user explicitly specifies which version of the target set should be used.
The advantages of using explicit version are as follows:
A specific version of a set is used for a relationship.
Relationships do not automatically evolve to new version of the target entity.
The disadvantages of using explicit version are as follows:
Extra information is needed in the model to store the select version.
There is extra work for data stewards to evolve relationships in source entities when changing the target entity. For example if there is a change to the Country set, then all explicit relationships to Country must be changed to reflect the new version.
5.2.3 Versioning mapping relationships
Two mechanisms are available to model a relationship in RDM:
Value relationship properties are simple and are managed as part of the reference data; they provide one-to-many relationships; they have implicit version selection.
Mappings are managed independently from the set data to which they refer. Mappings have their own lifecycle, and may be included in many-to-many relationships; they have their own properties on relationships and explicit version selection.
Based on your needs you can choose one model or a mix of both.
5.2.4 Version strategy state machines
With InfoSphere MDM Ref DM Hub, although you can create multiple versions of a set or mapping, you might not want to. In some cases, where the reference data is managed outside of the InfoSphere MDM Ref DM Hub, you might want to keep only a single version that contains the current reference data from the external source. InfoSphere MDM Ref DM Hub lifecycle governance models provide a way to have a single version for some sets while providing a richer lifecycle for authoring and approvals.
The Active Editable governance model is intented to be used when you want only one version of the reference data. The model allows you to edit it and provides no operations to move it to any other state. The data is active and available and can be updated by data stewards.
The Simple Approval lifecycle model allows different user roles to participate in an authoring process where data can be edited in draft state, and go through an approval step before becoming active. After the version is approved, it becomes read-only.
These processes are fully configurable and you can define additional processes as needed. Although any number of lifecycles or state machines can be defined, during design be aware that you cannot always go backwards (for example, to reopen them for editing or to move them out of retirement).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset