Understanding the architecture
This chapter provides an introduction to the general architecture of IBM InfoSphere Change Data Capture (InfoSphere CDC). This chapter introduces some key InfoSphere CDC terms, or specify the usage of terms within InfoSphere CDC where they have a specific meaning within the context of InfoSphere CDC.
An architecture is a set of defined terms and rules that are used as instructions to build products. In computer science, architecture describes the organizational structure of a system. An architecture can be recursively decomposed into parts that interact through interfaces, relationships that connect parts, and constraints for assembling parts. Parts that interact though interfaces include classes, components, and subsystems.
The major components of InfoSphere CDC are described here to provide an orientation to the main parts and pieces of the product and provide an understanding of how they interact to form the whole of the InfoSphere CDC replication system.
In addition to this high-level view, this chapter describes some more fundamental architectural concepts within InfoSphere CDC and go into some detail regarding the implementation of some of the key components, to provide you with a more in-depth understanding.
However, this chapter does not describe specific features, functionality, and usages of InfoSphere CDC in this chapter; those descriptions are the purview of the other chapters in this book. And there is a great deal of information that is also available in the product documentation. This chapter also does not describe platform-specific topics to any great degree, other than to point out some of the differences in their implementation of InfoSphere CDC.
By the end of this chapter, you should have a good understanding of the major components of InfoSphere CDC and the nuances of how it operates. With this understanding, you are able to envision how the product could be set up in your environment to move your data efficiently and rapidly from one database system to another one.
6.1 Component overview
InfoSphere CDC assumes that you have a supported database that has tables that you want to replicate to a target. This replication could be accomplished two ways:
In snapshot form, where all current data in a table or tables at some point is moved to a target.
Capturing changes to the data as soon after they occur and moving only the changed data to the target, either as it is or with specific changes made to the data. This replication process could also include some metadata about
the changes.
To accomplish this replication, the InfoSphere CDC implementation includes an InfoSphere CDC source engine and an InfoSphere CDC target engine to send, receive, and apply data changes. Most InfoSphere CDC engines can serve as a source engine, capturing database changes from the source database, and as a target engine capable of receiving change data and applying it to the designated target database or other destination, such as DataStage or a JMS queue.
An InfoSphere CDC engine is also commonly referred to as an InfoSphere CDC instance because all the work associated with the engine is held in a process or set of operating system processes that are combined. From the Management Console and Access Manager perspective, an InfoSphere CDC engine is
a data store.
Only source and target engines are required for replication to occur, but there are additional configuration and control interfaces used primarily for configuration, control, and monitoring (Figure 6-1).
Figure 6-1 Replication landscape
Here are brief descriptions of the InfoSphere CDC components:
Source and target InfoSphere CDC engines: These engines send, receive, and apply data changes. An InfoSphere CDC engine is also commonly referred to as a InfoSphere CDC instance. All of the work associated with the engine is held in a process or set of operating system processes that are combined. From the Management Console and Access Manager perspective, an InfoSphere CDC engine is seen as a data store.
InfoSphere CDC metadata: Configuration information that is associated with a specific InfoSphere CDC instance, such as database connection information, subscriptions, and table mappings. The subscription and table mapping definitions are distributed across the source and target InfoSphere CDC engines.
InfoSphere CDC Server command-line interface: Native commands that are run on the server that runs the InfoSphere CDC engine. These commands start and stop the entire InfoSphere CDC instance and subscriptions. This command-line interface (CLI) can only control the engine that provides
the commands.
InfoSphere CDC Access Server: This service controls all configuration, control, and monitoring access to the InfoSphere CDC engines other than the InfoSphere CDC Server CLI. The InfoSphere CDC Management Console GUI, Management Console CLI, and Java API all pass through the InfoSphere CDC Access Server to obtain access to the source and target InfoSphere CDC engines. Access Server also has a CLI to control the access of users to the various engines in the environment, but this section does not describe this tool. For more information about the Access Server CLI, see the InfoSphere Management Console Administration Guide at the following address:
InfoSphere CDC Management Console GUI: The most commonly used interface to configure and control InfoSphere CDC. All configuration, control, and monitoring activities are contained in this interface. The Management Console GUI connects to the InfoSphere CDC engines through the InfoSphere CDC Access Server.
InfoSphere CDC Management Console Command Line Interface: A less deployed but useful interface to control InfoSphere CDC operations. The Management Console CLI connects to the InfoSphere CDC engines through the InfoSphere CDC Access Server.
InfoSphere CDC API: This interface is sometimes used in large and complex environments. The InfoSphere CDC API provides full control over configuration, control, and monitoring by using Java classes and methods. This interface connects to the InfoSphere CDC engines through the InfoSphere CDC Access Server.
The Management Console GUI, Management Console CLI, and InfoSphere CDC API all must connect to the Access Server first before being able to do anything with the InfoSphere CDC engines. When connecting to the Access Server, a user name and password must be provided, which controls the access level of the user.
6.1.1 InfoSphere CDC instances
An InfoSphere CDC instance is a change data capture process that is associated with a particular database instance of a specific type (for example, DB2 or Oracle), with a message queue (InfoSphere CDC Event Server), or with IBM DataStage. There are InfoSphere CDC engines available for a wide range of database products and versions, on a wide variety of hardware platforms and operating system environments. An instance can handle both the source and target sides of replication, and can simultaneously be both a source and a target. For example, in a three-tiered scenario, InfoSphere CDC instance A captures change data from database X and sends it to InfoSphere CDC instance B acting as a target and applying the changes to database Y. Instance B can also act, at the same time, as a source that scrapes the logs for the same or a different set of tables and sendd the change data to InfoSphere CDC instance C as a target, which applies the change data to database Z.
An InfoSphere CDC instance has a CLI consisting of native commands that are run on the server that runs the InfoSphere CDC engine. These commands start and stop the entire InfoSphere CDC instance and subscriptions. The CLI can only control the engine that provides the commands. Many of the commands that can be run through this CLI can also be run through the Management Console interface.
In addition, each InfoSphere CDC instance has a metadata store specific to the instance. The configuration metadata that is associated to a specific InfoSphere CDC instance includes such information as database connection information, subscriptions, and table mappings. It is important to know that the subscription and table mapping definitions are distributed across the source and target InfoSphere CDC engines.
After InfoSphere CDC has been installed, one or more instances can be created, with each instance running as a separate process. Instances are created after an installation of InfoSphere CDC, or by running (for Java-based engines) the InfoSphere CDC configuration tool dmconfigurets and adding an instance (Figure 6-2).
Figure 6-2 Creating a instance
An instance can be controlled (started and stopped) from the Configuration Tool as well (Figure 6-3). There are separate InfoSphere CDC command tools that can also be used for this purpose, but the Configuration Tool is a convenient interface for this task.
Although the GUI interface to the configuration tool is shown here, there is also a console version for cases where it is preferred or where a GUI cannot be used.
Figure 6-3 The InfoSphere CDC configuration tool
An InfoSphere CDC source instance and a target instance are all that is required to run InfoSphere CDC to capture change data and move it to a target after the system has been configured.
Because InfoSphere CDC incorporates a wide set of features and functionality, configuration is an important aspect of the product. One of the key modules in InfoSphere CDC is the Management Console, a GUI tool that runs on Microsoft Windows and is used to define what data on the source side is replicated, and how and to where it is replicated.
In addition to configuring InfoSphere CDC, there are also operational aspects of the Management Console, such as starting and stopping of replication activity, and monitoring activity while it is in process.
As a graphical user interface, the Management Console makes many otherwise complex tasks intuitive and straightforward. InfoSphere CDC also provides other tools that can be used for these tasks as well, for example, when you want or must automate some aspects of both configuration and control of a replication environment. These additional interfaces to InfoSphere CDC are described elsewhere in this chapter, and in the product documentation.
Although the basic tasks of developing and administering an InfoSphere CDC replication environment is made straightforward by the Management Console, the Management Console is a powerful tool that is likely to b used by multiple personnel within an organization. You should plan and assign privileges and roles appropriately. In a replication system, there might be multiple InfoSphere CDC source engines and target instances, which might be configured and controlled by one or more Management Console installations.
The last major component of InfoSphere CDC is the Access Server, which serves as a relay point between InfoSphere CDC instances and Management Console. Access Server knows about InfoSphere CDC data stores and InfoSphere CDC users, both defined in the Management Console. An InfoSphere CDC data store is essentially an InfoSphere CDC instance seen through the Management Console GUI. It is a source of changed data or a target that consumes that data. Users have specific types (which define the tasks they may perform) and are attached to specific data stores. As such, Access Server serves as the validation point for Management Console user logins.
Access Server can run on a separate machine than the InfoSphere CDC engine (and usually does), and is supported on Linux, UNIX, and Windows platforms.
6.1.2 Interoperability between the InfoSphere CDC components
InfoSphere CDC is designed to allow different versions of the product to work together. As a heterogeneous replication solution, a version of InfoSphere CDC for DB2 on System i must, as an example, be able to replicate to and from InfoSphere CDC for Microsoft SQL Server running on Windows Server 2008. In addition, InfoSphere CDC Access Server must be able to communicate with both components, and this is true for all InfoSphere CDC versions that can serve as a replication source or target. As expected, there are limits to this cross-release interoperability, and matrixes of compatibility are readily available in the
product documentation.
6.2 Management Console fundamentals
InfoSphere CDC Management Console is a Java-based and rich client interface for an InfoSphere CDC replication system, and consists of four major functional subdivisions:
Access control
Configuration
Operation
Monitoring
It is also available as a command-line interface (CLI). The Management Console connects to the InfoSphere CDC engines through the InfoSphere CDC
Access Server.
Multiple Management Consoles may be concurrently active for the same system, allowing a team approach to both developing and controlling a replication system with InfoSphere CDC in daily operations. Only one Management Console may be active on a particular machine, and team members should have Management Console installed on their own workstations.
This section describes Management Console concepts to provide a high-level overview of the functionality of InfoSphere CDC, and a framework for the introduction of some key architectural concepts. This section does not describe specific details about the use or operation of Management Console, or specifics of configuration options. For more information, consult the online documentation installed with the product, or visit the InfoSphere CDC Knowledge Center for InfoSphere CDC 6.3 at the following address:
There is also an InfoSphere CDC Knowledge Center for InfoSphere CDC V6.5 at the following address:
6.2.1 Access Manager Interface
The Access Manager Interface in Management Console creates and configures InfoSphere CDC data stores, and creates InfoSphere CDC users and associate them with specific data stores (Figure 6-4).
Figure 6-4 Access Manager Interface
Data stores
To design a replication configuration, InfoSphere CDC needs to have access to a source database from which it captures change data, and know the target to which the change data is delivered. An InfoSphere CDC data store is an InfoSphere CDC component that describes a source or target system, the host name on which InfoSphere CDC is running, and a database user ID and password for the system. In operation, InfoSphere CDC replicates to or from a data store. A data store is essentially a representation of an InfoSphere CDC instance (Figure 6-5).
Figure 6-5 Data store component
InfoSphere CDC users
The Access Manager component of InfoSphere CDC allows the creation and control of InfoSphere CDC users.
There are distinct roles for users. One role is the System Administrator role, which can optionally create and manage data stores and users. Two other roles are the Operator and the Monitor roles, which can view replication state and status, view statistics, and events and table mappings, but cannot start or stop replication or perform any configuration.
Users are associated with data stores, and a user's privileges are active for the data stores with which they have been associated.
6.2.2 Configuration Interface
The Management Console configuration interface is used to define what tables from a source system are replicated, in what way (that is, what if any changes should be applied to the data), and which tables on the target system are recipients of the source data.
Types of replication
The two main types of replication in InfoSphere CDC are refresh and mirroring.
Refresh
A refresh operation, also known as a snapshot, generally involves a truncation of a target table and the insertion of the rows in the source table to the target. Under the heading of Refresh, there is a refresh operation called Differential Refresh, where differences between the source and target tables are applied to the target, bringing the two into the same state in a different manner from a full refresh. This type of refresh also has the option of logging any changes that are found and applying them, or logging the differences while not applying any of the changes required to make the source and target identical. There is also a range refresh where only rows from a specified range are brought over to the target. An InfoSphere CDC Refresh operation does not involve capturing change data from the source database log file, but rather reads from the source table and sends rows across to the target as inserts.
Mirroring
Mirroring involves the capture of change data from the source database log files, and moving the change data over to the target. Under the heading of mirroring, there is the Mirror Continuous option, where an InfoSphere CDC source engine runs continually, capturing change data on an ongoing basis and moving it to the target engine that is also continuously running.
There is also Mirroring Scheduled End where mirroring is run periodically to capture and move change data since the last time InfoSphere CDC was run.
Having a subscription in mirroring mode means that InfoSphere CDC tracks change data for the subscription. When log entries are read for tables in the subscription that have a Mirroring mode, they are processed and sent to the target (when a commit is read for the transaction of which the change is part.) If a table is not going to be mirrored, putting it into a state of Refresh Park ensures that it has no impact on mirroring activity. When a table is set to Mirroring mode, InfoSphere CDC enables additional logging on the table. When you set a table to Refresh Park, additional logging is disabled, and the stored bookmark position
is discarded.
For the InfoSphere CDC engine on z/OS, additional logging (DATA CAPTURE CHANGES) is enabled when the table is first selected for replication (if not already configured on the table). Logging is turned off when the table is removed from a subscription, provided it is not included in other subscriptions.
A common way to begin replicating a table is to set it to refresh before mirroring. This action performs a refresh of the target table so that it is initially synchronized with the source, and when this action completes, mirroring
automatically commences.
Subscriptions
A subscription is a logical container that describes a replication configuration from tables in a source data store to a target data store. All tables in a subscription are kept synchronized together. In addition to metadata about the subscription as a whole, such as latency thresholds for notification, a subscription consists of replication table mappings. A table mapping maps one source table to one target table, and a source table may be mapped once in any subscription. A table mapping also defines the type of replication (such as mirroring or refresh). In the place of a source column, you can have an expression or a Journal Control Field, which is a value derived from the source log file related to the operation, such as the record modification time or record modification user.
A subscription should group tables in such a way as to maintain referential integrity. For example, if table T1 has a foreign key that references table T2 and these tables are mapped to corresponding tables T1 and T2 on the target that have the same RI constraints, then these mappings should be in the same subscription. Within a subscription, operations are applied on the target in the same order that they are run on the source. It is important to understand that change data replicated as part of a subscription are all committed in the same transaction, thus ensuring data consistency.
There are many other powerful features that can be configured, such as row filtering based on some criteria or data translations or wanted encoding changes.
A subscription is shown in Figure 6-6.
Figure 6-6 InfoSphere CDC subscription
For ease of administration, InfoSphere CDC subscriptions can be contained within a Management Console project. A project is not a InfoSphere CDC component, but a way to group subscriptions within the Management Console. A Management Console project is transferable between Management Console installations using import / export.
You might need to have more than one person modify an InfoSphere CDC configuration at the same time, and for this reason, subscriptions can be locked for editing to prevent collisions among multiple users operating with different instances of Management Console.
The Management Console Configuration interface is used to create and configure subscriptions, which map source tables to target tables (Figure 6-7).
Figure 6-7 Management Console configuration interface
6.2.3 Monitoring Interface
One of the properties of a mapping within an InfoSphere CDC subscription is its replication method, which can be set to either Mirror or Refresh (the replication method can be changed using Management Console). After a subscription has been fully set up, it is necessary to start the replication activity, and have InfoSphere CDC begin replicating data.
There are several means of accomplishing this task, but the Monitoring interface of Management Console is the most commonly used means. Other means include the InfoSphere CDC engine command-line tools and the Management Console API.
A subscription may be started in Refresh or Mirroring mode. In Refresh mode, table mappings that have a replication method of Refresh, and tables with a replication method of Mirroring that have been flagged for refresh, are refreshed one at a time until all the refreshes have completed. When starting in Mirroring mode, any tables mappings that have been flagged for refresh are refreshed before the start of mirroring replication.
While the subscription is replicating, there might be other subscriptions defined t are in an idle or inactive state.
The Management Console Monitoring interface can be used to initiate Refresh operations and start mirroring (Figure 6-8).
Figure 6-8 Refresh
When Refresh or Mirroring begins, the status of the subscription in the Monitoring interface changes to reflect the replication state.
The Management Console Monitoring interface provides many monitoring views of mirroring activity and InfoSphere CDC events, and is shown in the Monitoring interface. If you click Collect Statistics, you can monitor replication activity in terms of operations or bytes, latency alerts are shown, you can view a graph of latency or operations, and view the InfoSphere CDC event log for
detailed messages.
InfoSphere CDC provides other mechanisms for starting and stopping mirroring activity, but Management Console provides a simple intuitive tool that serves most needs.
Figure 6-9 Monitoring interface
6.2.4 InfoSphere CDC API
Sometimes used in large and complex environments, the InfoSphere CDC API provides full control over configuration, control, and monitoring with Java classes and methods. This interface connects to the InfoSphere CDC engines through the InfoSphere CDC Access Server, and can be used to enable automated configuration and control. Further information about the use of the API is in 9.1, “Options for managing InfoSphere CDC” on page 232.
6.2.5 Access Server fundamentals
Access Server is a service that controls all configuration, control, and monitoring access to the InfoSphere CDC engines other than the InfoSphere CDC Server CLI. The InfoSphere CDC Management Console GUI, Management Console CLI, and Java API all pass through the InfoSphere CDC Access Server to obtain access to the source and target InfoSphere CDC engines. Access Server also has a command-line interface to control access of users to the various engines in the environment. It supports the running of multiple Management Consoles. For more information about the Access Server CLI, see the InfoSphere Management Console Administration Guide at the following address:
6.3 The InfoSphere CDC engine
Because InfoSphere CDC is a heterogeneous product (it can replicate between a wide range of hardware platforms, operating systems, and database products), there are a few different types of engines, all of which can act as a source or target. Because the internal architectures vary to some degree, they are described separately in this section after this section describes the InfoSphere CDC bookmark concept that is common to all of the types.
6.3.1 Bookmarks
The source database system can have many independent connections making concurrent changes to the data. At any time, one or more of these connections may have an open transaction containing all the changes that have been made on that connection since it last committed or rolled back. All these uncommitted changes also have been written to the database log.
InfoSphere CDC maintains these uncommitted changes in its Transaction Queues. There is a separate queue for each open transaction where InfoSphere CDC accumulates the changes for that transaction as they are read from the database log. InfoSphere CDC stores the uncommitted changes it has read from the log in to transaction queues until a commit is seen (Figure 6-10). Changes are removed from these queues when rollbacks are done in the
source database.
Figure 6-10 Transaction queues
InfoSphere CDC only sends committed changes to the target to be applied. As each transaction is committed, InfoSphere CDC sends the changes for that transaction (taken from its transaction queue) to the target. The flow of change data from the source to the target can be seen as a stream of complete transactions being sent in the order in which they were committed in the source database. This flow is called the Replication Stream.
InfoSphere CDC maintains a “bookmark” as part of this Replication Stream. This bookmark is persisted into the target database along with the application data. It contains all the information necessary for InfoSphere CDC to be able to continue replication from that point. It is committed as part of the same transaction in which the change data is written to the target database.
The two primary pieces of information in this bookmark are:
Restart Position: InfoSphere CDC needs to begin rereading the database logs to recreate the transaction queues. This action generally corresponds to the beginning of the oldest open database transaction at the time the bookmark was constructed by InfoSphere CDC. Using this Restart Position, InfoSphere CDC is able to recreate the Replication Stream.
Stream Position: After InfoSphere CDC has recreated the Replication Stream, the stream position indicates exactly where InfoSphere CDC was in that stream when it last applied data to the target.
The InfoSphere CDC bookmark contains all the information necessary to recreate the replication stream on the source side and position it to resume replication at exactly the point of the last operation applied on the target.
Bookmark information is stored on the target side and is updated whenever data is applied to the target. It is communicated to the source engine while running to keep the source informed about log dependency. For restart, it is sent at the time mirroring starts to recreate the replication stream on the source at exactly the right point. It is this key mechanism that allows InfoSphere CDC to always know where scraping must be resumed, to not resend change data already applied on the target, and to not miss rescraping changes that were possibly previously read but lost in-flight as a result of an abnormal termination.
6.3.2 The InfoSphere CDC Linux, UNIX, and Windows engine
The InfoSphere CDC for Linux, UNIX, and Windows engine is a version ported to database products that run on these platforms.
InfoSphere CDC captures change data on a source from the source database log files. This scenario involves two separate components:
A log reader
A log parser
In Figure 6-11, these components are grouped as one, as the granularity of operations performed is not essential to an understanding of the activity at
this level.
Figure 6-11 Source operations component data flow for the Linux, UNIX, and Windows engine
During mirroring, a source InfoSphere CDC engine only sends committed data to a target. A source engine reads the log files, parses the log records and some metadata, such as the transaction ID and the table name, and stores them temporarily in the transaction queues before sending a number of committed transactions to the target.
InfoSphere CDC optimizes the resource utilization on the source server by, in most circumstances, using a single scrape component to only read the log records once and retain the operations derived from them for a time in a staging store. In certain circumstances, such as when the needed log records are not in the staging store, InfoSphere CDC employs a private scrape to get the needed log records. InfoSphere CDC dynamically determines if and when to employ a private scrape and when to remove one by having the subscription rejoin the single scrape.
LOBs do not participate in single scrapes, as they are retrieved directly from the database by the Mirror Moderator.
For refresh operations, the source engine retrieves rows using JDBC and puts them through the normal source processing before sending them to the target as insert operations.
Linux, UNIX, and Windows engine single scrape
To minimize resource consumption, InfoSphere CDC attempts to only read and parse the source database log files once. Employing single scrape optimization provides considerable overall efficiency improvements when replicating
multiple subscriptions.
If a source system has constrained resources that might be taxed by having multiple log reader / log parser threads concurrently active, using single scrape in this scenario can be beneficial.
Because much of the log data is not table-specific, it is better not to have multiple threads processing and discarding the data. This situation is especially true where subscriptions are replicating only a small portion of the data being changed. For example, if two subscriptions have no tables in common but each is replicating only 10% of the total changes, then single scrape provides a huge benefit.
Single scrape is enabled by default in Version 6.5.1 for all InfoSphere CDC source instances (except for the Oracle Trigger version, which does not read the Oracle log to obtain source operations). This situation only applies to the Linux, UNIX, and Windows engine versions.
If you do not keep a single scrape cache of sufficient size to allow the subscription to use single scrape, the subscription will have its own parser. This situation results in a more manageably sized cache.
These situations include where a subscription has fallen too far behind the single scrape process and the content the subscription needs is no longer available. This situation could occur if, for example, a subscription was not running or idle for a period, and then later restarted.
If a subscription had been idle (that is, not mirroring) for a period and is then started, the log entries the subscription needs might not be found in the staging store. In this case, the subscription has its own log reader and log parser. If at some point the subscription caught up so that needed content is available in the staging store, the subscription joins single scrape and its dedicated reader, and the parser would go away.
A subscription might be too far ahead and waiting for single scrape to advance to its position is too inefficient compared to having the subscription run on its own. This situation could occur when a subscription has refreshed all of its tables, and so is scraping near the head of the log while Single Scrape is still scraping an older portion of the log.
In either case, for a subscription running with its own reader and parser, InfoSphere CDC can decide to have the subscription join (or rejoin) single scrape when it becomes more efficient than maintaining the reading and parsing
activity separately.
InfoSphere CDC combines the best advantages of a single reader / parser thread with the flexibility of per subscription reader parser threads, and can automatically compute and run with the most advantageous arrangement.
However, there may be cases where you want to manually force a certain mode. You can decide if InfoSphere CDC allows subscriptions to run independently or to turn Single Scrape off, and thus force all subscriptions to run independently.
There are implications to turning Single Scrape off and forcing all subscriptions to use it. Those implications, and the InfoSphere CDC system parameters required to force all the subscriptions, are described later in this chapter.
Linux, UNIX, and Windows engine staging store
The source instance component diagram shown in Figure 6-11 on page 123 maintains a staging store of change data that can be used by active subscriptions. To maintain maximum throughput, to the degree that memory constraints allow it, InfoSphere CDC maintains this staging store in memory.
However, InfoSphere CDC might need to move some of this staged data to disk. This situation could occur if, for example, some active subscription was unable to move the data to the target as rapidly as it was accumulating, causing the staging store size to grow. This situation could occur if the network between the source and that target temporarily became busy, the target database became busy, or because mirroring for that subscription was stopped temporarily.
If you allow a subscription to remain idle for a long period, it is possible that the staging store size on disk might become considerable. For this reason, InfoSphere CDC provides a configuration option to set the disk quota for this storage, which defaults to 100 GB, but can be as small as 1 GB, or as large as you want to set it. Subscriptions that are not actively mirroring are said to be idle. InfoSphere CDC Single Scrape excludes tables that are in a table mapping that has been marked for Refresh mode and parked. Tables in mappings that are Mirror / Parked are still included for scraping.
During a controlled shutdown of InfoSphere CDC, the portion of the staging store in memory is persisted to disk, allowing for faster restart of mirroring. If there is an abnormal termination of InfoSphere CDC due to a system crash, the logs are read from the correct point so mirroring can resume with no loss of data. The persistence of the staging store on disk is not essential for restarting replication, as long as the database logs holding the required entries are still available. The format of persisted staging store data is internal to InfoSphere CDC and cannot be used by another application to read change data.
During mirroring of change data, each subscription retrieves the changes for its set of tables from the staging store. Data is removed from the staging store after it has been sent to all subscriptions that are using or could use the store, or when the staging store size threshold has been reached.
It is a common practice among users to create a test subscription or subscriptions. These subscriptions should be parked or deleted after work with them is completed so that single scrape does not take them into account and maintain data for them in the staging store.
Linux, UNIX, and Windows Engine Staging store size considerations
The value used for staging_store_disk_quota_gb only guarantees that InfoSphere CDC does not exceed that amount of storage for the staging store and does not pre-assign disk space. You must ensure that the configured amount of disk space is available to InfoSphere CDC. Should the staging store be below the disk quota and you still run out of disk space, an error occurs and mirroring halts.
In general, InfoSphere CDC only adds data to the staging store when at least one subscription is mirroring. For situations where subscriptions are only run periodically, but you want to have InfoSphere CDC capturing changes all the time, you can use command-line commands to control the InfoSphere CDC continuous capture feature. For more information about this feature, see Chapter 8, “Performance analysis and design considerations” on page 211. When enabled, InfoSphere CDC reads logs and adds change data to the staging store all the time, even when no subscriptions are running. This setting affects the amount of data that must be stored in the staging store, and almost certainly causes the staging store to grow bigger than the available memory and thus be partially persisted to disk.
If the InfoSphere CDC staging store disk quota is reached, subscriptions are forced into using their own dedicated log reader and the advantages of having multiple subscriptions use Single Scrape are lost until the situation is rectified.
It is important to avoid out-of-disk space situations, as these situations result in abnormal termination of InfoSphere CDC and require rescraping of the database logs. The staging store is only affected by subscriptions with tables configured for mirroring. The store contains data for all tables being mirrored (used in a table mapping with a replication method of mirror). Any subscriptions used solely for refreshing tables (where all the table mappings have a replication method of refresh) do not affect the behavior of single scrape or contribute to the size of the staging store.
When all the source subscriptions are running and have little latency, the volume of data kept in the staging store is relatively small, corresponding essentially to the delay in apply speed between the fastest and the slowest subscriptions.
When all the source subscriptions are running but some are latent, then the staging store needs to contain all the data for the latent subscriptions. For example, if one of the subscriptions is one hour latent, then that hour's worth of data for that subscription needs to be kept in the staging store along with that hour's data for all subscriptions.
Persisting the staging store to disk during operation
The data in the staging store is organized into a sequence of data blocks. If the staging store grows to a size that cannot be kept in memory, then some of these data blocks are persisted to disk and removed from memory. InfoSphere CDC is optimized to select blocks to write out based on the most likely best performance outcome, and there are no available tuning parameters for this situation.
Persisting the staging store to disk at shutdown
When the InfoSphere CDC engine is shut down by running dmshutdown -c (normal shutdown), InfoSphere CDC persists all the data in the staging store to disk so that the data is available when the engine restarts.
Single scrape events and errors
Single scrape errors and events are visible in the event log. A list of these events and errors are provided in Appendix A, “Single scrape events and errors” on page 431.
Transaction queues
Uncommitted transactions accumulate on the source until they are committed. If memory requirements dictate that some of the transaction queue data needs to be moved to disk to free some memory, some of the transaction queue data may be persisted in temporary files.
Persisting transaction queues at shutdown
When a normal shutdown is called for a subscription that is mirroring with a private parser, the uncommitted data that InfoSphere CDC maintains in memory is written out to the InfoSphere CDC transaction queue storage. This storage is maintained in a repository specific to each InfoSphere CDC instance, located in the instance configuration directory, and is called txqueue.
If all subscriptions running with private parsers are shut down normally (controlled) at the same time, then all uncommitted transaction data needs to be stored in the transaction queue repository at the same time.
Shared scrape also has transaction queues. These queues are persisted whenever the last subscription running with shared scrape stops.
If a subscription is using single scrape, it has no parser and no transaction queue data. If it is stopped, no transaction queue data is persisted at that time. If a subscription that is using a private scraper is stopped, its uncommitted transaction data is persisted. When all subscriptions that are using shared scrape have stopped, the single scrape parser persists its uncommitted transaction data at that time.
Because all private parsers and shared scrape are using the same repository, when all subscriptions are stopped, the size of the repository grows to the size needed to store all of this data at the same time.
The txqueue repository grows to be as large as the total amount of uncommitted transaction data for all subscriptions that existed when they were all shut down.
A long-running large uncommitted transaction, many subscriptions that are shut down (controlled), and subscriptions that are added over time can cause the txqueue database to grow to a large size at shutdown time.
This internal repository does not shrink files after they have grown. It might appear that there is much content in the txqueue database, but that might not be the case. If the files have grown large and disk space is a concern, the txqueue database can be deleted.
Failure to persist transaction queues
Abnormal termination of InfoSphere CDC prevents it from being able to persist the in-memory transaction queue data to disk. Because the data is persisted to disk when a subscription does a normal shutdown, if a subscription that is mirroring is stopped using the shutdown immediate option, or if the machine crashes or the InfoSphere CDC source process is terminated, there is no transaction queue data available for the subscription at startup. InfoSphere CDC must always be able to revert to the source database log files and start from the first operation of the oldest open transaction that existed when the last source transaction committed (that was most recently applied on the target). If there were no open transactions at the time of the controlled shutdown, then InfoSphere CDC starts from the commit of the transaction that was most recently applied on the target.
Persisting the transaction queue data during a controlled shutdown is an optimization that permits a faster startup of the mirroring continuous mode, because the data can be read from the persisted data much faster than it can be reread from the database log files.
This situation also means that the txqueue files can always be deleted from the instance configuration directory. The only effect of this action is that when mirroring is started for a subscription, it needs to begin reading the database log further back than it otherwise would, and so might take longer to start replicating.
If the only subscriptions that are not running are subscriptions that were not stopped using shutdown controlled the last time they were run, if these subscriptions are refreshed before mirroring them again, or the user does not care about these subscriptions (for example, if all their tables are currently parked or the subscription is never used), then the persisted txqueue files have no effect.
Linux, UNIX, and Windows engine metadata in brief
InfoSphere CDC for the Linux, UNIX, and Windows engine stores most of its metadata in its own separate repository, independent of other repositories, with a small portion of the metadata maintained in the host database. This operational metadata includes the TS_AUTH table with some instance metadata, and the TS_BOOKMARK table that is used to store the InfoSphere CDC bookmark for the last applied transaction on the target. This setup enables InfoSphere CDC to always be able to compute the correct restart position in the source log files.
The InfoSphere CDC configuration metadata for the InfoSphere CDC Linux, UNIX, and Windows engine is stored in a separate container from the database system from which it is replicating or to which it is replicating. InfoSphere CDC uses the IBM PointBase pure Java light-weight database, which is used for several purposes by InfoSphere CDC. In addition to the configuration metadata storage on the source and target sides, IBM PointBase is also used to store InfoSphere CDC events, InfoSphere CDC statistics, and persisting transaction queues at shutdown.
The InfoSphere CDC configuration metadata can be viewed by running dmmdconsole -I <instancename>. However, the metadata is not directly editable at the user level and the connection obtained by the command l is read-only. The metadata is defined by configuring InfoSphere CDC through the use of Management Console or the InfoSphere CDC tools. Changes to the metadata must be made through these tools only.
Generally speaking, metadata is an internal implementation detail, which you do not have to understand to understand the product. However, you might find it helpful to understand the metadata structure, and so these details are provided. A listing of the main InfoSphere CDC metadata tables is available as a PDF for downloading and viewing from the IBM Redbooks website. For information about how to access that PDF, see Appendix B, “Additional material” on page 435.
The configuration metadata should be backed up periodically by
running dmbackupmd.
As mentioned previously, the replication subscription and table mapping definitions are distributed across the source and target engine metadata stores. After your configuration changes, you should back up the metadata on both the source and target sides, because recovering an InfoSphere CDC system to a previous state requires restoring the metadata on both sides. It is best practice to back up the InfoSphere CDC metadata along with and at the same time as the user database.
Linux, UNIX, and Windows engine as a target engine
All source operations are moved to the target engine and become target operations. A target operation contains both the before and after image for an update, and maintains all state information about the object (derived expressions and user exit calls). The target engine generates SQL statements and writes the bookmark information to the TS_BOOKMARK table.
The target flow is shown in Figure 6-12.
Figure 6-12 Target operations component data flow
6.3.3 The InfoSphere CDC for System i engine
InfoSphere CDC on System i uses the same concepts as used in other InfoSphere CDC versions. On System i, the equivalent functionality is implemented as different jobs for each component, such as the mirror driver, scraper, source control, target control, and apply.
There is no single scrape mechanism. Instead, each subscription has its own log reader process using the receive journal entry API to capture change data for the tables that are in scope for that subscription. There is one log reader (DMSSCRAPER) per subscription and per journal, because on System i you can have one subscription with tables journaled to different journals. There is one mirror process that merges the changes coming in from the different scraper jobs into source operations to send to the target side.
As change data is read from the log, it is stored in a System i user space object until a commit is read for the transaction. However, most applications on
System i do not use commit control and therefore the database performs auto-commits when operations are applied to the DB2 tables. For most applications, there is no staging at the source and operations read from the journal are directly sent to the target. The entries in the user space are not kept after the subscription ends mirroring. When mirroring resumes for the subscription, InfoSphere CDC obtains the bookmark information for the last applied operation from the target and begins reading the journals from
that position.
As a target, InfoSphere CDC on System i uses native apply to write to DB2. It builds up the buffer of the operation image and then writes it natively. The bookmarks for all subscriptions targeting a System i engine are kept in the user space in the InfoSphere CDC product library; the user space is
called JRN_STATUS.
For refresh, as a source, InfoSphere CDC writes user-defined entries into the journal to mark the refresh start and end for the refresh while active logic. When mirroring is restarted, change data between these markers is read and sent across to the target to be applied there, as is done with other versions of InfoSphere CDC.
For refresh as a target, refresh messages received from the source system are processed as insert records, becoming a series of row by row inserts using native apply.
When a user initiates a (normal) controlled shutdown of a subscription on System i, a S user-defined journal entry is written into the journal. When the log process reads this entry, it initiates the shutdown procedure. Any processing of entries before that point is completed, and InfoSphere CDC processes then end.
InfoSphere CDC on System i maintains its metadata files directly in DB2 tables (physical files), which are kept in the InfoSphere CDC product library.
6.3.4 The InfoSphere CDC for z/OS engine
The InfoSphere CDC for DB2/z engine shares many of the same concepts with the Linux, UNIX, and Windows engine previously described.
InfoSphere CDC parameters cannot be set from the Management Console as with the other engines. Parameters must be set in the control data sets for InfoSphere CDC. System parameters for this engine are not dynamic and are only read when the instance is started (started task).
InfoSphere CDC metadata for both source and target descriptions is stored in the DB2 database, and can be accessed through SQL. User access to it is available through reporting utilities that allow documenting of the InfoSphere CDC environment. Metadata can thus be saved and restored for product roll-out.
InfoSphere CDC for z/OS has a single scrape component that acts as a log reader and log parser. The component checks and analyzes the source database logs for all of the subscriptions on the selected data store, and uses Hiperspaces for staging data that it has read until it reads a COMMIT DB2 Log record. Single scrape on this engine is designed to reduce MIPS and should only be enabled if there are two or more subscriptions.
Hiperspaces are stored in z/OS Central Storage that is above the bar (64-bit addressable storage), and may be paged to z/OS Auxiliary Storage if the page frames in Central Storage are taken by the operating system on behalf of another address space.
This scraper component, the database synchronous log task (DSL), reads the DB2 log blocks for in-scope tables and reconstructs before and after images for operations on those tables. The images are sorted into ascending date and time sequence and added to a memory file.
The Database Log Scraper (DLS) task reads the memory file written by the DSL task and gathers the before and after row images into commit groups, which are written to a memory file as logical transactions.
A Communication Initialization and Termination (CIT) task reads the logical transactions from the memory file and sends them across to the InfoSphere CDC target engine to be converted to target operations and applied to the
target database.
InfoSphere CDC supports concurrently running product address spaces under the same image of IBM MVS™ / ESA. Each InfoSphere CDC address space (or instance) is associated with only one DB2 subsystem. A DB2 subsystem can have multiple associated InfoSphere CDC address spaces. For multiple InfoSphere CDC address spaces, there is no communication between them other than InfoSphere CDC source-target communication.
The source flow in a z/OS environment is shown in Figure 6-13.
Figure 6-13 z/OS source flow
As a target, InfoSphere CDC uses storage above the bar (64-bit addressable storage) to cache changes that are applied to tables at the target. As the changes are received from the source environment by the CIT, they are cached and applied to the target tables by a database table change task. This task issues a SQL request to write the table row changes to the DB2 target table, after first applying any transformations specified in the metadata and starting any
user exits.
When the applied changes are committed, the changes are purged from
the cache.
If DB2 backs out the logical unit of work before the changes can be committed (for example, due to a deadlock or timeout condition), then InfoSphere CDC rereads the changes from the cache and reapplies them.
For refresh, as a source, a Database Table Refresh (DTR) task uses an SQL bulk read to retrieve rows from the source table. Checkpoint information is written into the InfoSphere CDC metadata table, causing DB2 to write a corresponding DB2 log record that marks the refresh operation begin point.
After a commit group is assembled on the source, a CIT task pushes the data to the InfoSphere CDC target.
For refresh as a target, refresh rows are received from the InfoSphere CDC source by the CIT task and passed to the Database Table Change (DTC) task. This task applies any transformations as specified in the metadata, starts any user exits, and then issues a SQL request to write the table rows into the
target table.
6.4 Communications between source and target
The InfoSphere CDC communications component (Comms) includes a monitor component that coordinates and performs health-check monitoring of other InfoSphere CDC components.
Each InfoSphere CDC instance has both a control channel using two TCP/IP sockets and each subscription has a data channel using two TCP/IP sockets (each being used unidirectionally).
The control channel allows for the sending of control messages between the source and target independent of what might currently be on the data channel. This situation allows such things as the target knowing that mirroring is shutting down for the subscription while there is still data in the data channel for it
to process.
For example, when a command is received by a subscription to begin mirroring, the Comms component requests the bookmark position for the subscription from the target engine. The return value is used to compute the starting position in the logs on the source engine side, taking into account data that might exist in the single scrape staging store and in the persisted transaction queue data.
6.5 Summary
This chapter introduced the general architecture of InfoSphere CDC, defined and described some key InfoSphere CDC terms, and provided an introduction to the primary InfoSphere CDC components.
Now you should have a good understanding of the InfoSphere CDC approach to replication. It should be sufficient to enable you to understand how InfoSphere CDC could be (or is being) used in your environment. And it should enable you to delve deeper into the workings on the platform you are operating with, and better understand the details of the wide range of functionality provided by
InfoSphere CDC.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset