InfoSphere CDC features and functionality
This chapter describes features and functionality for IBM InfoSphere Change Data Capture (InfoSphere CDC). InfoSphere CDC administrators learn how the different features and functions can be used to customize the replication flow to meet the unique requirements of their environments. InfoSphere CDC operators gain knowledge about how the different features and functions affect the replication process and how that affects operations.
The topics in this chapter include the following:
Transformational capabilities:
This section describes each of the group types of transformations, including examples of usage. The intent is to make the reader aware of the diverse transformational possibilities available. InfoSphere CDC provides the ability to manipulate data during the replication flow. These transformations can be made against any or all columns included in a single operation. They provide functionality such as table joins, date and string manipulation, and IF / THEN / ELSE logic.
Replication modes:
This section describes each of the modes and provide details specific to each one. The intent is to make the reader aware of the different replication modes and when they would use each one. InfoSphere CDC provides three replication modes: two mirrored modes and one refresh mode. The mirrored modes capture data from the transactional logs of the database and then replicate that data downstream to the target. One mirrored mode runs continuously until stopped and the other mode allows scheduled stops. The refresh mode is an ETL style mode pulling data directly from the tables. It can be used to synchronize the data between the source and target tables. Different types of refreshes, including subset and differential refreshes,
are covered.
Filtering:
The intent of this section is to make you aware of the filtering capabilities of InfoSphere CDC and how those capabilities can assist you in targeting specific subsets of data for replication. InfoSphere CDC provides functionality to filter data that is replicated at both the row and column level. The row level filtering is similar to the WHERE clause in a SELECT statement in that it can be used to include or exclude rows of information during replication. The column level filtering can be used to select which columns to include or exclude during replication. It can also be used to identify which columns are considered critical and trigger replication of the row or operation.
Apply methods:
This section describes the potential benefits provided by each apply method and when to use them in a replication environment. InfoSphere CDC can apply data to the target tables using six different modes, which are Standard, IBM LiveAudit™, Adaptive Apply, Summarization, Consolidation One to One, and Consolidation One to Many. Each apply method provides unique functionality that allows the user to create not only the standard replication environment but other specialized replication environments
as well.
Conflict detection and resolution:
The last section describes when to use conflict detection and conflict resolution, how to set them up, and the usages for each of the resolution methods. The intent is to make the user aware of how InfoSphere CDC can assist in resolving conflicts between source and target data without compromising the replication flow.
InfoSphere CDC provides a Conflict Detection and Resolution functionality that can be integrated directly into the replication flow. This functionality can be used as part of a bidirectional replication or when the source and target data might not be consistent. It allows the user to identify specific columns to be analyzed for conflicts and five different methods to resolve the conflict.
5.1 Transformations
InfoSphere CDC allows businesses to share data between heterogeneous sources and targets. This heterogeneity often involves sources and targets that have different data structures, so it might require some of the following:
Mapping between different column names
Conversions between data types
Changes in data sizing
Default values
Creating column values through derived expressions
Accessing other tables for additional data
Running user written code for processing outside of InfoSphere CDC
Mapping for differences in naming, type, size, or default values can typically be handled in a straight forward manner in InfoSphere CDC by using the graphical user interface provided by the Management Console (MC). Creating values through derivations or table joins, or running user written code in the form of User Exits, uses InfoSphere CDC transformational capabilities. Some of those capabilities are described in the following examples of what InfoSphere CDC provides:
A diverse list of column functions that can be used along with standard mathematical operators to build derived expressions. Within Management Console, derived expressions can be manually scripted or built using the drag editor. These derived expressions can be saved and reused when mapping other columns.
Access to additional data beyond that provided by the associated log entry. Both 'before' source values and 'current' target values, and journal control field values, are available to be mapped directly to target columns or used within derived expressions.
The ability to include additional data by joining to other tables. This ability can be used to include master information from lookup tables, or it can be used to combine rows from multiple source tables into a single target row.
Multiple points within the replication flow to perform user written user exits. These user exits can be used to provide additional and specialized functionality to meet the unique needs of the client. User exits are typically used when a single derived expression is insufficient to generate the
wanted results.
Within InfoSphere CDC, transformations are performed as part of the replication flow. The transformations can be performed at the source (derived column) or at the target (derived expression). This flexibility allows the user to determine where to expend the additional processing impact required for the transformation.
5.1.1 Column functions
InfoSphere CDC provides a number of column functions that can be used within derived expressions when mapping target columns. These column functions help perform low-level data transformations during the flow of replication.
The column functions can be categorized into the following six functional groups:
String functions can manipulate strings of text. They include functionality such as concatenation, substring, and left and right trimming. Here are examples
of concatenation:
 – Denormalization of data by combining multiple columns of data into report or query friendly formats:
%CONCAT(%RTRIM(FIRSTNAME), ' ', MI, ' ', %RTRIM(LASTNAME))
 – Retrieving subsets of data from within long character string columns, such as acquiring the person's first initial:
%SUBSTRING(FIRSTNAME, 1,1) -
 – Minor cleaning of text data to remove trailing blanks or spaces from fixed-length source columns to variable length target columns:
%RTRIM(FIRSTNAME)
Date and time functions are used to manipulate date and time values. They can add a two-digit century specification to a date, or retrieve the current date and time. Here are some simple examples:
 – Acquiring current date:
%CURDATE("*GMT")
 – Creating a time stamp field by combining date and time fields together:
%TODATE(DATE, "*YMD", TIME)
Conversion functions are used to manipulate data types. They can move between character, numeric, and date values. Here is an example:
 – DIVISION (VARCHAR 10) column contains 'US2003'. The data needs to be manipulated to add 1000 to the number value in the last four characters, resulting in 'US3003'. The transformation would be:
%CONCAT(%SUBSTRING(DIVISION,1,2), %TOCHAR((%TONUMBER(%SUBSTRING(DIVISION,3,3)) + 1000),4))
Conditional and variable functions provide flexible logic. Use the %IF column function to evaluate an expression and return different results. Use the %VAR function to declare a new variable, assign a value to it, or retrieve the value of an existing variable. Here is an example:
 – Load the SALARY_LEVEL column with the values 'LOW', 'MID', or 'HIGH' based upon the value in the SALARY column. The transformation
would be:
%IF(SALARY < 30000, 'LOW', %IF(SALARY < 60000, 'MID', 'HIGH'))
Data functions are used to provide additional data beyond that provided by the actual log entry. They provide information such as before or current values. They can join to other tables to acquire additional data values. Here is an example:
 – A client had problems with hackers changing grades on a school database. They resolved the issue by replicating the before image of a grade back into the grade column if the column was changed by a user other than a special admin user ID. Effectively, as fast as the hackers changed the grades, the grades were restored to their previous value. The following transformation was used:
%BEFORE(GRADE)
User exit functions are used to call user written code from within a derived expression. Depending upon the InfoSphere CDC platform installed, these functions could be stored procedures or user functions, such as Java code. The following transformation could be used:
%STPROC('CUSTNAME', CUSTOMER_ID)
Two examples of its use could be:
 – Passing information to a secondary table not targeted by InfoSphere CDC but based upon an event encountered during InfoSphere CDC replication, such as suspected erroneous data being sent to a holding table.
 – Triggering a database event based upon input received during a InfoSphere CDC replication. In this example, a client wanted to perform a backup of certain tables immediately following the complete replication of data from a nightly batch job. The customer updated a record in a status table as the last operation of the batch job. InfoSphere CDC replicated the update for the status table, which resulted in the user exit being initiated. The user exit performed the backup and various
other functions.
5.1.2 Journal control fields
Journal control fields provide information about the log entry on the source system. When a change is made on the source system, the database records the change in a log entry that contains the changed data and what type of change was made, who made the change, and when the change was made. When a relevant log entry triggers a replication event to the target system, InfoSphere CDC replicates the changed data along with the extra log entry information available through journal control fields.
InfoSphere CDC provides many journal control fields that contain log entries from your source systems. The journal control fields can be:
Mapped directly to columns on the target system. Here are examples:
 – Map &TIMSTAMP and &USER to target columns to show when the original transaction took place and who made it.
 – Map &CCID (commit cycle ID) to target columns to identify transactions that belong together in a single unit of work or commit group.
Used in row level filtering to identify log entries to include or omit:
&JOB = 'PURGE001'
Used in derived expressions:
%IF(%SYSTEM = 'USHDQR', 'Atlanta', 'Toronto')
5.1.3 Joining
InfoSphere CDC provides a column function (%GETCOL) that allows you to join tables to retrieve the value of a column for a specific row. This function allows you to retrieve and use information not found in the actual log entry.
 
Joins: Joins performed at the source occur at the time InfoSphere CDC reads the transaction log. The data retrieved by the join matches the current values in the join table and might be the same values as when the original log
entry occurred.
You can use the %GETCOL function in expressions to perform the
following operations:
Obtain columns from one or more keyed secondary tables and join them with an existing primary table before sending the data to the target. The primary table refers to the source table being replicated. The secondary tables refer to tables referenced in the %GETCOL function.
Specify how keys of the secondary tables are populated, to allow InfoSphere CDC to perform the necessary secondary reads.
Use columns from secondary tables that were retrieved previously to populate keys for subsequent reads (the population of key column values is not restricted to primary columns only).
Specify the order that table reads are performed.
Condition the table reads that are performed.
Read tables external to InfoSphere CDC to perform dynamic translations on the target.
There are two syntax formats for %GETCOL:
The long syntax format is used to read a table and return the value of the column specified, based on the key column values that are identified. If more than one row satisfies the key requirements specified, then this function returns the first row only. If the read is unsuccessful, then this function returns the default value specified.
The short syntax format is used to return the value of the specified column from a row retrieved by a previous %GETCOL function invocation. The short syntax lets you retrieve more than one column from a table (that was read previously using the %GETCOL function), without reading the table again. The previous %GETCOL function invocation must be for the same log entry during continuous mirroring or the same row during refresh. Here are some examples:
 – A client wants to denormalize their target table by joining information from various tables. They use their detail table as the main table to be replicated. They then perform various joins to gather information not included in the detail record, such as customer name, address, and telephone number.
 – A client wants to keep records on a target table synchronized with changes made to master tables on a source database. In this case, the main table is replicated by InfoSphere CDC using standard apply and joins to gather the additional data from the master tables. Anytime a change to the main table is replicated, the joins are performed to gather the additional data from the master tables. The master tables are replicated to the target table, but use the Consolidation One to Many apply method. As changes to the master table are replicated, InfoSphere CDC updates as many records as necessary in the target table to correspond to the new master table values.
5.1.4 User exits for customizations
A user exit is a processing point where a user written program can be started. The program itself is called a user exit, and can return a result. This mechanism allows defining operations that need to be performed at the user exit point. The logic defined in the user written program determines the actions that occur.
 
User exit examples: User exit examples are further described in 9.7, “User exits” on page 369.
User exits can be used to:
Define data transformations for a specific working environment
Handle business requirements not handled by the predefined column functions provided by InfoSphere CDC
Perform operations or provide notification in response to specific product conditions
A user exit lets you define a subroutine that InfoSphere CDC can start when a predefined event occurs. The following two types of user exits are available:
Table-level user exits run a set of actions before or after a database event occurs on a specified table.
Subscription-level user exits run a set of actions before or after a commit event occurs on a specified subscription.
Subscription-level user exits can work in tandem with table-level user exits to track which table-level user exits were started during a transaction. The subscription-level user exit could then use that information to apply actions based on the tables in the transaction.
5.1.5 Considerations for using transformational functionality
Some things to consider when using InfoSphere CDC transformational functionality are:
Depending on the InfoSphere CDC engine you have installed, some column functions might not be available.
Column function syntax may differ between platforms. For example, the syntax for %GETCOL is different for the IBM i platform than the one used by DB2 on Linux, UNIX, or Windows.
Names for column functions are not case-sensitive.
For some column functions, large object (LOB) columns cannot be specified as a function parameter.
Character literals can be specified in their internal numeric representation, as parameters for column functions. To accomplish this task, use the double-angled bracket notation (<< >>). This notation allows both printable and non-printable characters to be used. Specifying character values as decimal integers can represent either American Standard Code for Information Interchange (ASCII) or Extended Binary Coded Decimal Interchange Code (EBCDIC) characters.
5.2 Replication modes
InfoSphere CDC provides three modes of replication:
Refresh
Continuous
Scheduled End (Net Change)
These modes provide flexibility about how and when to replicate. Refresh replication provides bulk copy functionality. Continuous and Scheduled End replication provide log based and mirroring functionality.
5.2.1 Refresh
The InfoSphere CDC refresh operation is the replication mode used to capture a complete copy of the data in the source table and transfer that data to the target table. Each refresh applies all features of replication (such as row and column filtering, column mappings, and so on) during the transfer.
Because refresh is an operation applied to the source table, the manner in which refreshed data is applied to the target table depends on the type of table apply method (such as standard and audit). Before the source table data is transferred, the target table may be cleared and indexes dropped before the refresh, and rebuilt following the refresh. If the target table is cleared, the data in the source and target tables is synchronized after completion of the refresh.
Typically, a refresh is performed only once to initialize target tables by transferring all the data in the source table or if configuration changes are made that require a refresh. Mirroring is then used to replicate changes as they occur. A refresh can be used to provide snapshot images of tables for use in reporting or for tables that are changed by batch loads only. Refreshes can be scheduled to run at specific intervals or times.
Some factors that determine the amount of time required to complete a
refresh are:
The size of the tables to be replicated and the number of columns and rows selected for replication
The width of the rows in number of columns
The existence of Large Object (LOB) data types
The amount of processing involved with replication (such as derived expressions and translations)
The speed and bandwidth of communications lines
The number of indexes on the target table
Canceling a refresh can be done at any time, but should only be done to interrupt a refresh. A controlled shutdown ends the process after the current table finishes its refresh activities. A controlled shutdown is the suggested method to end a refresh process. An immediate shutdown ends refresh processes without completing the refresh. This action might require the refresh to be redone to guarantee data consistency between the source and target tables.
Tables can become out of synchronization for various reasons, including the following issues:
Parked tables: If a table is parked from replication to make changes (such as updating the definition of a source table), and the changes on the source table are no longer being replicated, the target table may become inconsistent with the source table.
Configuration changes: A refresh might be necessary when a set of subscriptions is promoted from a test environment to a production environment. The promotion operation may add new transformations or other table mapping changes that require the source and target tables to be refreshed to prepare for mirroring.
Maintenance operations: For large bulk SQL operations performed during maintenance windows on the source table that affect a majority of the rows, it might be faster to resynchronize using refresh. Refresh might be faster than mirroring to replicate millions of changes due to the ability to bulk load rows into the target database.
InfoSphere CDC refreshes all of the flagged tables within a single subscription as one sequential operation that runs to completion. Each table is refreshed individually one at a time until all flagged tables have finished refreshing. Refresh is an operation that applies to a single subscription, so while one subscription is refreshing, other subscriptions are not affected. They may continue mirroring data for different tables or refreshing tables as required. To perform a parallel refresh, multiple subscriptions can be used.
InfoSphere CDC offers two primary types of refresh operations: Standard Refresh and Differential Refresh.
A standard refresh results in a complete copy of the data in a source table being sent to the target table. This type of refresh is typically performed to bring the entire source and target tables into synchronization.
A differential refresh updates the target table by applying only the differences between it and the source table. This type of refresh is typically performed when the target table is already synchronized with the source table. With a differential refresh, you can choose to perform a refresh only, refresh and log differences, or log differences only.
The order in which data is retrieved from the database during a refresh depends on the type of refresh performed. During a standard refresh, no ORDER BY sort is used; the database determines the order in which the data is returned. During a differential refresh, InfoSphere CDC queries the database using an ORDER BY sort on the table keys chosen in the table mapping to sort the source and target tables and determine their differences.
When a refresh is performed with multiple tables, the order in which each individual table is refreshed is based on the group order, as set in the Refresh Order option. If no Refresh Order is set, then tables are refreshed in alphabetical order. After a refresh has successfully completed, the subscription can be restarted for mirroring. InfoSphere CDC then processes the backlog of changes.
If Referential Integrity (RI) is in effect on the target tables being replicated to, it is important to set the refresh order. If the tables are refreshed in the incorrect order, database failures can result when InfoSphere CDC tries to rebuild the foreign key constraints. It is also important to make sure all tables that make up the RI are part of the same InfoSphere CDC subscription.
Here are examples of RI usage:
Perform initial synchronization of the source and target tables.
Restore synchronization following the loss of data integrity on the target table.
Identify differences between the source and target table using
differential refresh.
Subset refresh
Both a standard refresh and a differential refresh can be further refined through the use of a WHERE clause to only include rows within a specified range. This situation is useful for tables where only the most recent data requires a refresh. This feature requires one of the following conditions be met:
The table capture point has been set, either explicitly or through the table having been mirrored.
The scraping point for the subscription has been set (using the SETLOGPOS command or dmsetbookmark command where applicable).
Here are examples of when to use subset refreshes:
When the source table's size precludes refreshing the entire table in
one process
When only the most recent data needs to be refreshed
When a particular subset of the data needs to be refreshed, such as a particular customer or part
Differential refresh
A differential refresh updates the target table by applying only the differences between it and the source table. Instead of the target table being cleared at the start of the refresh and repopulated with data, the differential refresh compares each row in the target table with each row in the source table to identify missing, changed, or additional rows. The primary advantage of the differential refresh is that the target table stays online during the refresh operation.
There are three possible methods for the differential refresh:
Refresh Only: Performs a differential refresh by changing any target rows that differ from the source rows.
Refresh and Log Differences: Performs a differential refresh and also creates a log table in the target replication engine metadata to track all changes during the refresh. The log table is identical to the target table, with the addition of a column to indicate the actions taken during the refresh, such as inserting a row, deleting a row, or updating a row. For an update, both the source and target row images are logged. This log table is created in the same database and table space as the TS_CONFAUD table (or DMMD_DMCONFAUD on z/OS), with the same owner as the metadata. The name of the log table is created by combining the subscription name, the target table name, and the refresh start date and time.
Only Log Differences: Creates and populates a log table in the target replication engine metadata to identify all differences between the source and target tables. The target table is not updated. This method allows you to evaluate what the differences are between the target and the source. If you then decide to refresh the table, you can go back to the subscription and select Refresh Only to update the target table or update the target table manually based on the contents of the log table.
Performing a differential refresh has some requirements and restrictions:
Differential refresh is only available for tables that use Standard replication.
The collation sequence of the source and target tables must be identical.
Derived columns on the source table are not supported.
Any target columns that are mapped to derived expressions, constants, or journal control fields are ignored.
The key columns of the target table must be mapped directly to columns on the source table.
Differential refresh sends all the data from the source to the target to perform the compare. Clients might want to consider combining the differential refresh with subset refresh to reduce the impact and timing of a full differential refresh. The differential refresh can also be performed in its own subscription, minimizing the impact to existing subscriptions.
5.2.2 Continuous mirroring
Continuous mirroring replicates changes to the target on a continuous basis. It is used when business requirements dictate replication to be running continuously without a clearly defined reason to end replication now.
Continuous mirroring can be ended with the following options:
Normal: Completes in-progress work and then ends replication. This option might take some time if there are in-progress transactions. The subscription ends in a confirmed and stable state.
Immediate: Stops all in-progress work and then ends replication. Starting replication after using the Immediate option can be slower than using the Normal option.
Abort: Stops all in-progress work and then ends replication rapidly. This option is the fastest way to end replication. Starting replication after using the Abort option can be much slower than using the Normal option.
Scheduled End: Processes all committed changes to the indicated point in the database log and then ends replication normally. You can use the current date and time (Now), a specified date and time, or a specific log position.
5.2.3 Scheduled end (net change)
Scheduled end (net change) mirroring replicates changes to the target up to a user specified point in the source database log and then ends replication. It is used when business requirements require data replication periodically and a clearly defined endpoint for the state of the target database.
Scheduled end (net change) mirroring allows replication to end based upon the following points in your source database log:
Current time (or now): This option can be used for subscriptions that are latent. The subscriptions replicate from their bookmark position up to the log position at the time the subscription is started. This option is useful when working with subscriptions that you want to guarantee have finished processing the logs before ending.
User specified date and time: This option guarantees that the subscription processes all transactions up to a certain time of the day. Typically this option is used when the user wants to maintain a set amount of latency between the source and target. An example is a reporting instance that is kept 24 hours behind the production database.
User specified log position: This option allows the user to replicate data from one log position to another. An example would be where the user needs to recover transactions up to a certain log position.
These user specified end points ensure that the target database is in a known state when replication ends.
Ending replication allows preparation for transitional activities in business environments and allows moving to the next step in the business process.
5.3 Filtering
InfoSphere CDC provides filtering at both the row and column level. Row level filtering can be used to include or omit rows in the replication flow. Critical columns can be used to filter rows based upon changes to specific columns. Column level filtering can be used to limit which columns are replicated.
5.3.1 Row level
Row level filtering supports both standard InfoSphere CDC and SQL SELECT WHERE expressions. Standard InfoSphere CDC expressions include column functions, arithmetic and Boolean operators, column names, and journal control fields. All row selection expressions must return a Boolean result.
The following are valid row selection expressions:
1 = 1
This expression always returns a true result, and so either all or no rows are replicated. The Boolean constants TRUE and FALSE are not supported. An example is NAME = 'Monica Flanagan'.
(SALES < 10000) OR (SALES > 90000)
Use parentheses to group Boolean expressions. Short forms, such as SALES < 10000 OR > 90000, are not allowed. In this case, you must specify
SALES twice.
NOT((AIRPORT = 'ATL') OR (AIRPORT = 'CLT'))
Short forms, such as NOT(AIRPORT = 'ATL' OR 'CLT'), are not allowed.
%RTRIM(DEPT) = 'IT'
Column manipulation functions can be used in row selection expressions, and they can be applied to any operand in a Boolean expression.
PRINCIPAL *(1 + INTRATE) > 20000
Basic numeric operators, such as multiplication (*) and addition (+), can be included in row selection expressions.
%IF(COUNTRY = 'US', PRICE, PRICE * 1.2) > 50
The %IF column function does not allow you to return a Boolean result, and so the result that is returned by the function must be compared with another value in the row selection expression.
STATE IN ('CA', 'AZ', 'AL', 'GA', 'CT')
The IN operator is valid in SQL SELECT WHERE clauses. This example can be used when working with an InfoSphere CDC product that supports this type
of expression.
&JOB = 'QTRPURGE'
Journal control fields can be used to identify transaction-specific information. In this scenario, &JOB can be used to identify a specific job and then include or omit records pertaining to the job.
InfoSphere CDC provides functionality to declare critical columns. You can use critical columns to control the updates replicated and reduce the workload on the network and target database. When you select a column as critical, InfoSphere CDC only replicates update operations when any critical column value
has changed.
Here is an example row level filtering:
Replication is wanted only when the account balance is updated in the customer account table. In this scenario, the CUST_ACCT_BAL column is selected as a critical column. InfoSphere CDC only replicates this row when there are updates made to the CUST_ACCT_BAL column.
Period end processing reads all of the records in a very large table and updates a time stamp field. If the time stamp field is not turned off as a critical column, InfoSphere CDC replicates the entire table, row by row. By making the time stamp field non-critical, that situation is avoided.
5.3.2 Column level
By default, InfoSphere CDC replicates all mapped and unmapped source columns to the target table, because unmapped source columns could be used on the target in derived expressions or used by a user exit. If there is a source column that should be excluded from replication, it can be cleared on the Filter Tab in Management Console. If you configure column level filtering, be aware that a change in a non-replicated column causes the change to be dismissed by CDC, saving bandwidth.
Examples of when columns might be excluded are:
The target table contains only 20 columns of the 150 columns in the source table. By selecting only the 20 columns required, the amount of data replicated to the target is reduced, saving network bandwidth and
processing impact.
The source table contains confidential information such as social security numbers and salary. The target table should not have access to this information, so these columns are excluded from replication.
Changes might occur only to columns excluded by the column level filter. In this case, InfoSphere CDC does not replicate the change, saving bandwidth.
5.4 Apply methods
InfoSphere CDC provides six apply methods that can determine how replicated data is applied to the target table. These methods are Standard, LiveAudit, Adaptive Apply, Summarization, Consolidation One to One, and Consolidation One to Many.
5.4.1 Standard
Under Standard replication, InfoSphere CDC replicates the effect of the source operation to the target table. A row insert operation on the source table would result in an inserted row on the target table, and so on.
Management Console provides two mechanisms for mapping using Standard replication.
One-to-One Mappings: Map multiple source tables to multiple target tables where the source and target tables share a table structure and similar
table names.
One table mapping (Standard): Map one source table to one target table using standard replication. These tables can have different table structures and names.
Standard replication can be used for scenarios such as the following:
Synchronize data between two databases to allow two different environments to work on the same data. This scenario can be used during parallel testing
of applications.
Provide a subset of source data to a reporting instance. This scenario allows reports to be run against the reporting instance and frees up resources on the source system.
Stream data from source systems to a dynamic ODS. This scenario allows nightly ETL jobs to be performed at the ODS system and frees up the source systems to run 24x7.
Consolidate data from multiple source systems into a single target system. This scenario provides a single view of the data and reduces the number of systems required to be accessed for reporting.
Distribute data from one source system to multiple target systems. This scenario could allow subsets of data to be distributed to target systems based upon filtering criteria.
5.4.2 LiveAudit
LiveAudit replication provides an audit trail of source table operations. When data is replicated using LiveAudit, the target tables contain rows that track insert, update, delete, and clear (truncate) operations applied to the mapped
source table.
LiveAudit maintains an audit table containing a row for each source operation. Additional information is included in the target row, such as operation type
(insert / update / delete), time stamp for the transaction, and user that created the transaction. Journal control fields can be mapped to additional target columns to meet specific business needs.
With LiveAudit, target tables have the potential to grow to be large. Sufficient disk space must be allocated or regular maintenance must be performed to accommodate large subscription tables that ware being used for auditing.
Management Console provides two mechanisms for mapping using
LiveAudit replication.
LiveAudit Mappings: Map multiple source tables to multiple target tables using LiveAudit replication.
One table mapping (LiveAudit): Map one source table to one target table using LiveAudit replications. This mechanism allows greater granularity of the options available for LiveAudit mappings.
Examples of where LiveAudit replication can be used are:
Regulatory requirements to keep audit trails of all operations against the data.
Change tracking of sensitive information.
Providing operations processed against source data to downstream applications. For example, an ETL application can use the time stamp values to determine which operations to process to ensure that target data is consistent with a specific point in time. The operation type is used to determine what type of operation needs to be processed.
Providing test data operations. An application can be used to process the audit data and generate a “replay” of the operations against a
test environment.
5.4.3 Adaptive Apply
Adaptive Apply replication provides flexibility when replicating operations to target tables that may not be synchronized with the source tables. With Standard replication, the operations are database dependant upon the absence (inserts) or existence (updates and deletes) of the target rows. An insert operation replicated from the source results in a database error if a matching target row is found (duplicate key error). With Standard replication, this action results in the subscription ending.
Adaptive Apply replication provides upsert functionality. For a scenario where a matching target row is found, the insert operation converts into an update operation. Replication does not encounter an error and continues. When an update operation cannot find a matching target row, the operation is converted to an insert operation. When a delete operation cannot find a matching target row, the operation is ignored.
Examples of where Adaptive Apply replication can be used are:
When external applications modify target tables independent of the
source tables.
When you are restoring the contents of a subscription table from recorded log entries. By setting the log position to a specific entry or point in time, Adaptive Apply can be used to populate an empty target table so that it contains the latest data.
5.4.4 Summarization
Summarization replication allows you to maintain numerical totals in selected target table columns. Under summarization, the target table is a repository of numerical data that has been accumulated or deducted in response to source row level operations transferred by refresh or mirroring activity.
Accumulation ensures that numeric changes applied to the target column are directly proportional to changes applied to the corresponding source columns. Deduction ensures that numeric changes applied to the target columns are inversely proportional to changes applied to mapped source columns.
Examples of where Summarization replication can be used are:
Summarization replication could be used to simplify accounting, statistical, and other business operations that require intensive addition and subtraction of numeric data. Instead of applying resource-intensive programs to determine totals from an accumulated stack of transactions, aggregates can be maintained in a target table as source row level operations are replicated. As a result, applications designed to generate more complex calculations for reports and other purposes can work directly with the currently
maintained totals.
Summarization replications can also be used to create a soft delete environment. See 5.4.6, “Soft deletes” on page 95 for more information.
5.4.5 Row consolidation
InfoSphere CDC provides two types of row consolidation: Consolidation One to One and Consolidation One to Many. Row consolidation allows flexibility with row ownership of the target table.
Using Standard replication, data warehousing can be implemented by configuring a target table to receive rows from multiple source tables. Each row in the target table can only be inserted, updated, or deleted by one of the source tables contributing data to the warehouse. Each row is effectively “owned” by a source table.
Row ownership by a source table can be too restrictive in some environments. When multiple source tables need to be merged to create a single target table row, row ownership does not work. Consolidation One to One replication allows this merger without the row ownership issues.
Consolidation One to One allows merging different information about a common entity, such as a person, a customer, or a product part, into a single row on the target table. It is intended for environments where information about the entity is scattered across different tables, databases, or servers, but must be centralized to facilitate report generation, data management, data security, and other business objectives and activities.
Consolidation One to Many allows data changes to a lookup table to be applied to all target rows affected by the change. A single change on the source lookup table might require hundreds, thousands, or more rows on the target table to be changed. Consolidation One to Many replication gives you the ability to keep the target rows current with changes not only against the primary source table, but also changes made to secondary tables used to provide
supplemental information.
Consolidation One to Many reacts to operations in the following manner:
Inserts: Inserts to the lookup table do not generate operations on the
target table.
Updates: Updates to the lookup table generate updates to rows matching the consolidation key value.
Deletes: Deletes to the lookup table do not generate operations on the
target table.
5.4.6 Soft deletes
InfoSphere CDC provides support for soft deletes. A soft delete is an operation where the target row is not deleted following a delete operation on the source row. Instead, the target row is inserted if it does not exist, or updated if it does. A flag field or entry type field is also updated to indicate that this row was deleted from the source.
All Databases - Soft deletes can be configured for tables using the Summarization apply type by completing the following steps:
1. Choose Summarization as the apply type.
2. Add a flag column on the target table (varchar(2)).
3. Do not select any columns for summarization.
4. Map the journal control field &ENTTYPE to the flag column.
Operations on the source row result in the following actions on the target row:
Insert: InfoSphere CDC attempts to update the row. If the row does not exist, InfoSphere CDC inserts it, and PT is placed in the flag column.
Update: InfoSphere CDC attempts to update the row. If the row does not exist, InfoSphere CDC inserts it, and UP is placed in the flag column.
Delete: InfoSphere CDC attempts to update the row. If the row does not exist, InfoSphere CDC inserts it, and DL is placed in the flag column.
Insert using deleted row key values: The delete flagged target row is overwritten with the values from the insert and PT is placed in the
flag column.
5.4.7 Custom apply methods (user exits)
InfoSphere CDC can use user written and maintained user exits to control the target apply process.
A few examples of where clients have chosen to use user exits to perform applies are:
Web Service: A user exit can be used to interface InfoSphere CDC directly with a web service. The user exit performs the apply process to the web service while InfoSphere CDC manages scraping the transactions from the source transaction logs and performs low-level transformations to the data.
Full Unit of Work to message queue: By default, InfoSphere CDC includes one operation per message when targeting a JMS message queue. Some customers want to see the entire Unit of Work within a single message queue. This action can be facilitated through a user exit when using the InfoSphere CDC Event Server engine.
Soft Delete: To use a standard apply process with soft deletes, a user exit is required. The user exit manages the processing required to not delete the row and update a flag field. See Example 9-83 on page 383 for more details.
5.4.8 Flat files
InfoSphere CDC can use flat files through the InfoSphere CDC DataStage engine. With this engine, two options for connecting to InfoSphere DataStage are available: flat file and direct connect.
With the flat file method, InfoSphere CDC produces a file containing information about one or more records and database operations. The flat file can be configured with the Single Record option so that the before and after images are included together in a single line followed by a delimiter, or with the Multiple Record option, where each record may occupy two lines (before and after images of an update).
InfoSphere CDC flat files can be used to quickly and efficiently capture information from source table transactions and pass the information along to any ETL engine that can read a flat file. This action allows InfoSphere CDC to complement many of the ETL products available on the market by:
Providing transactions capture throughout the day as opposed to at set times
Providing low impact to the source by reading the transaction logs and not the tables directly
Removing the need for batch ETL processing on the source that requires tables locked or users off the system
Providing transactions for just the data that changed as opposed to having to read through the entire table
Configurable options for flat files are:
Location of flat files
Threshold limits in number of operations or time, which ever comes first, for hardening the flat file
InfoSphere CDC continues to write to a temporary data file until one of the threshold limits are met, or in the case of refreshes, after the entire table is contained in the flat file. After the threshold is met or the refresh is completed, InfoSphere CDC hardens the temporary data file and make it available for consumption. It adds a time stamp to the file name.
InfoSphere CDC creates a <Table_Name>.stopped status file when the refresh operation or mirroring is ended. When InfoSphere CDC restarts, the bookmark is used to determine where the scrape process should begin in the transaction log, so no data is lost.
 
Important: If the refresh or mirroring operation is terminated using the dmterminate command, the temporary data file may not be hardened and no <Table_Name>.stopped status files may be generated for the tables in the subscription. After the replication process is restarted, the subscription uses the last-saved bookmark to reposition the log reader and start generating new data files. The temporary file is not cleaned up. To ensure that the temporary data files are hardened, and the <Table_Name>.stopped status files are created, use a Normal or Scheduled End shutdown in Management Console, or a dmshutdown command with the appropriate flags for the severity level.
5.4.9 DataStage direct connect
InfoSphere CDC provides integration to connect directly with InfoSphere DataStage. This action allows:
Integrated control of both the InfoSphere CDC and InfoSphere DataStage products
Template jobs with metadata information to be created and passed from InfoSphere CDC to InfoSphere DataStage
Commit status on target through bookmarks
The ability to synchronize the restart point for an InfoSphere CDC subscription to guarantee no loss of data
Ability to autostart and securely connect to InfoSphere DataStage jobs
The process for the Direct Connect connection method is similar to the Flat File connection method. The size and time limits set in the InfoSphere DataStage Properties dialog box determine when data is committed to the target database.
With the Direct Connect connection method, you can enable the autostart feature to run in active mode, which allows InfoSphere DataStage to start a job when appropriate and begin to stream data to InfoSphere DataStage. Running with autostart enabled requires both InfoSphere CDC and InfoSphere DataStage to be installed on the same server. If autostart is not enabled, you must start the DataStage jobs using a different mechanism before the InfoSphere CDC process can start replicating the changes. You can start the InfoSphere CDC subscription before starting the DataStage job; the replication waits for the DataStage job to become active.
Examples of when to use InfoSphere CDC with Direct Connect to InfoSphere DataStage are:
As part of the Change Data Delivery (CDD) configuration for using InfoSphere CDC to replicate multiple sources to InfoSphere DataStage (typical with retail customers with multiple stores)
When complex transformations are required to the data through InfoSphere DataStage
To optimize the data capture process for InfoSphere DataStage and minimize the impact to the source database
5.4.10 JMS message queues
InfoSphere CDC Event Server is the target engine used to create XML documents that are placed into JMS message queues. Using Event Server with an InfoSphere CDC source engine, transactions can be captured and delivered to a JMS message queue. The messages are then available to be processed by any application, such as WebSphere MQ, that can reading a JMS message queue. The requirement for the ESB application is that it is able to read and process a JMS message queue.
Options for InfoSphere CDC Event Server are:
Map source columns to XML elements and attributes.
Configure the header information for the XML document.
Load to staging table then convert to XML.
Perform low-level transformations to the data.
Import and export mapping projects and XML schemas.
Build XPath expressions.
Query columns from other tables.
Set runtime options.
There are two methods for mapping to a JMS message queue:
Message Destination Mappings: Uses the Map Tables wizard to map a source table to a JMS message destination. InfoSphere CDC Event Server receives the row-level operation and transforms this row into XML. The XML message is sent to a JMS application supported by InfoSphere CDC
Event Server.
One table mapping of any type:
 – Standard: Allows source tables to be mapped to a target table within a staging database before being converted to an XML document. This type allows transformations to be done to the data through a user exit on the target and frees up resources on the source.
 – Adaptive Apply: Similar to Standard, but allows for replication to an empty target table. Adaptive apply functionality converts inserts to updates or updates to inserts as needed.
5.5 Conflict detection and resolution
Conflict detection and resolution let you detect, log, and act on inconsistent data on the target. This function ensures that the replication environment handles data conflicts automatically and in accordance with business rules. Set conflict detection so that InfoSphere CDC can detect and resolve conflicts as they occur. As conflicts are detected and resolved, InfoSphere CDC logs them in a conflict resolutions audit table.
During replication, InfoSphere CDC detects conflicts when you:
Insert a row and the row's key exists in the target table. This action violates the unique key constraint.
Update a row and the row's key does not exist in the target table.
Update a row and the contents of the rows in the source table and target table, before the update, do not match.
Delete a row and the row's key does not exist in the target table.
Delete a row and the contents of the rows in the source table and target table, before the delete, do not match.
InfoSphere CDC does not detect conflicts in target columns that are:
Populated with expressions using the %BEFORE, %CURR, %GETCOL, %STPROC, and %USER column functions
Populated with journal control fields
Not populated by a value
Contain a Large Object (LOB) data type
Conflict detection and resolution can be applied to individual columns for tables configured for Standard replication. There are five possible conflicts and
detection resolutions:
Source Wins: The source row overwrites the target row. If the target row does not exist, the source row is inserted. This resolution helps maintain consistency between the source and target tables.
Target Wins: The target row remains intact and the source row information
is discarded.
Largest Value Wins: The largest value for a column is used to determine whether to use the source row information or the target row information. For example, if the source row contains a time stamp that is newer than the target row's value, the source row overwrites the target row. Null values are treated as the smallest value possible. Therefore, if the target row does not exist, the source row is inserted. If the source and target values are the same, InfoSphere CDC resolves the conflict using the Target Wins method where the target row remains as is.
Smallest Value Wins: The smallest value for a column is used to determine whether to use the source row information for the target row information. If the target row does not exist, InfoSphere CDC uses Null as the smallest value and the row is not inserted.
User Exit: When InfoSphere CDC resolves conflicts with a user exit program, it applies the image returned by the user exit program to the target table. A user exit program can be configured to specify the row InfoSphere CDC uses to resolve the conflict on the target table.
Here are examples of when to use conflict detection and resolution:
Bidirectional replication: Changes can be made to the same data on both sides of the replication. However, this action could result in conflicts.
Expected inconsistencies in the data: The target data is not synchronized with the source data, and business rules need to be applied to record and address the differences.
Refreshes performed outside of InfoSphere CDC: When an application other than InfoSphere CDC is used to refresh data, it is common to start InfoSphere CDC at a log position before the time of the refresh. This action allows InfoSphere CDC to pick up incomplete transactions not reflected in the data refreshed. Conflict detection and resolution can be turned on for the subscription to handle the resulting duplicate transactions. Once past the refresh point, Conflict Detection and Resolution can be turned off to optimize throughput.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset