InfoSphere CDC: Empowering information management
Chapter 1, “Introduction and overview” on page 1 describes how the optimized data integration solution addresses the increasing need for timely access to changing data before making critical business decisions. The optimized data integration solution allows businesses to access, move, and deliver data in a timely and cost-effective manner from source systems on which the data is located to the target system or application where the business requires the data. That chapter also describes how the solution fits into the broader InfoSphere Information Server landscape within the Deliver Pillar by providing timely and reliable movement of heterogeneous data. InfoSphere data delivery capabilities can be used in multiple projects across the enterprise. These projects include keeping back-end inventory and front-end web applications synchronized, distributing global data across regional offices, feeding a data warehouse or MDM system, enabling real-time analytics, and optimizing ETL processes by providing a real-time flow of data changes.
This chapter describes the need for dynamic data, the delivery methods that can be used to move data, and how dynamic data can be provided by the IBM InfoSphere Change Data Capture (InfoSphere CDC) technology.
2.1 The need for dynamic data
There are three basic types of data available that can be used for
informational purposes:
Persistent or static: This type of data is, once created, not changed. As such, it is typically accessed less frequently.
Streaming: This type of data continuously flows. As such, it must be captured as it passes by, or it is missed. It is dynamic in the sense that it is moving, but there might be no indication that the data elements in the stream have changed, or that they flow in a consistent pattern or period.
Dynamic: This type of data is changed as changes become available due to the execution of transactional events. Changes are not necessarily made on a consistent periodic basis, and there might be small or large periods when there is no data change activity.
Dynamic data is prevalent and represents constant change in an environment. Consider a manufacturing business that builds parts or assemblies. The time to build parts or assemblies varies, based on a number of parameters. This data needs to be captured because it impacts manufacturing costs, and resource quantities and availability, and is therefore needed in decision making processes. As such, the value of the dynamic data is needed as it changes to support application execution, and enables users to make more informed business decisions by having the most current data.
As competitive and economic pressures increase, up-to-date and trusted dynamic data is needed to make decisions that benefit the business. To be successful, organizations must report and analyze corporate data quickly and easily, regardless of what applications created the data, what platform they are running on, and where or how they are stored. This situation occurs because the data needs to be synchronized between systems and applications that are using that data for informational purposes.
The ability to easily capture and deliver business-critical dynamic data throughout your enterprise provides the following business benefits:
Increase business agility: Make proactive business decisions based on business relevant events, for example, to notify a customer when their pre-paid phone card is almost empty.
Make better decisions: Enable customers, employees, and partners to base key decisions on up-to-date information, for example, to purchase additional inventory when there are only two parts left.
Access near real-time data without impacting operational systems: Gain access to near real-time data without impacting source systems and database resources.
2.2 Data delivery methods
When it comes to moving data, there are essentially three primary approaches:
Virtual data delivery
Bulk data delivery
Incremental data delivery
These approaches can be further segmented into replication and change data capture.
These data delivery methods are shown in Figure 2-1.
Figure 2-1 Data delivery methods
Brief descriptions of the three approaches are as follows:
Virtual or Federation: Generates a virtual consolidated view from multiple and disparate sources systems as though they were a single source. This method complements or extends the data warehouse view and is generally used when some data cannot be physically moved due to licensing or security reasons. This approach is frequently used as a first step, before using a fully implemented data warehouse, to query multiple source systems without physically consolidating them.
Bulk Load or ETL: Data is extracted from the originating source system, transformed, and output to the data warehouse or receiving application. This approach is typically used on a scheduled batch basis when point-in-time data is acceptable to meet the business needs. For example, ETL batch processes are frequently run during end of day jobs, resulting in a data warehouse or reporting database that presents data current to the
previous day.
Incremental Data Delivery: Businesses that opt for this method of data delivery require their data to provide up to the minute or near real-time information. This method includes both replication and change data capture. Replication is typically used for database to database data movement and provides solutions for continuous business availability, live reporting, and database or platform migrations. When using change data capture, the target is not necessarily a database. In addition to the solutions included in replication, this approach can also feed changes to an ETL process or deliver data changes to a downstream application by using a message queue.
2.3 Providing dynamic data with InfoSphere CDC
Change data capture uses a developed technology to integrate data in near real time. InfoSphere CDC detects changes by monitoring or scraping database logs. The capture engine (the log scraper) is a lightweight, small footprint, and low-impact process on the source server running where the database changes are detected. After the log scraper finds new changed data on the source, that data is pushed from the source agent to the target apply engine through a standard Internet Protocol network socket. In a typical continuous mirroring scenario, the change data is applied to the target database through standard SQL statements.
The changed data is scraped from the source log, sent over the network, and applied to the target database without passing through any intermediate tables, files, or queues. A simple and intuitive user interface allows users to determine what data needs to be integrated and what transformations need to be performed on the data before being applied to the target database.
By having the data only interact with the database logs, additional load is not put on the source database and no changes are required to the source application. Change data capture uses both online / active and archive logs with each source engine, optimized for the database and platform on which it is running. For example, when running on the mainframe and monitoring DB2/z logs, the standard DB2 instrumentation facility interface (IFI) is used.
Incremental data delivery is shown in Figure 2-2.
Figure 2-2 Incremental data delivery
2.3.1 InfoSphere CDC architectural overview
Figure 2-3 provides an overview of the InfoSphere CDC architecture.
Figure 2-3 Architectural overview
The key components of the InfoSphere CDC architecture are:
Access Server: Controls all of the non-command-line access to the replication environment. When you log on to Management Console, you connect to the Access Server.
Admin API: Operates as an optional Java based programming interface that you can use to script operational configurations or interactions. After you have set up replication, Management Console can be closed on the client workstation without affecting active data replication activities between source and target servers.
Apply Agent: Acts as the agent on the target that processes changes sent by the source.
Command-line interface: Allows you to administer data stores and user accounts, and to perform administration scripting, independent of Management Console.
Communication Layer (TCP/IP): Acts as the dedicated network connection between the source and the target.
Data store: The source and target data stores represent the data files and InfoSphere CDC instances required for data replication. Each data store represents a database to which you want to connect and acts as a container for your tables. Tables made available for replication are contained in a
data store.
InfoSphere CDC Management Console: The interactive application that you use to configure and monitor replication. It allows you to manage replication on various servers, specify replication parameters, and initiate refresh and mirroring operations from a client workstation. Management Console also allows you to monitor replication operations, latency, event messages, and other statistics supported by the source or target data store. The monitor in Management Console is intended for time-critical working environments that require continuous analysis of data movement.
Metadata: Represents the information about the relevant tables, mappings, subscriptions, notifications, events, and other particulars of a data replication instance that you set up.
Mirror: Performs the replication of changes to the target table or accumulation of source table changes used to replicate changes to the target table at a later time. If you have implemented bidirectional replication in your environment, mirroring can occur to and from both the source and target tables.
Refresh: Performs the initial synchronization of the tables from the source database to the target. These tables are read by the Refresh reader.
Replication Engine: Sends and receives data. The process that sends replicated data is the source capture engine and the process that receives replicated data is the target engine. An InfoSphere CDC instance can operate as a source capture engine and a target engine simultaneously.
Single Scrape: Acts as a source-only log reader and a log parser component. It checks and analyzes the source database logs for all of the subscriptions on the selected data store.
Source transformation engine: Used to process row filtering, critical columns, column filtering, encoding conversions, and other data to propagate to the target data store engine.
Source database logs: Maintained by the source database for its own recovery purposes. The InfoSphere CDC log reader inspects these logs in the mirroring process, but only looks for those tables that are mapped
for replication.
Target transformation engine: Used to process data and value translations, encoding conversions, user exits, conflict detections, and other data on the target data store engine.
The two types of target-only destinations for replication that are not
databases are:
JMS Messages: Acts as a JMS message destination (queue or topic) for row-level operations that are created as XML documents.
IBM InfoSphere DataStage®: Processes changes delivered from InfoSphere CDC that can be used by InfoSphere DataStage jobs.
2.3.2 Reliability and integrity
The InfoSphere CDC fault-tolerant architecture maintains data consistency and provides recovery from network and database outages. Change data is sent from the source database through the source or log scraper agent to the apply agent through a TCP/IP connection. Only committed transactions are sent to the target. To ensure that the source system transaction order is maintained, the transactions are applied to the target in the same order as the change data found in the source database log (Figure 2-4).
Figure 2-4 Reliability and integrity
As part of each replication thread, there is a mechanism to track which transaction is being processed. That mechanism is called a bookmark. The bookmark marks a point in the flow of committed changes. It contains all the information necessary for InfoSphere CDC to be able to restart replication. The InfoSphere CDC agent has a small metadata table stored in the target database that contains the last successfully applied bookmark. Updates to target tables are combined with an update to the bookmark table and applied to the target database as a single unit of work. This setup ensures that the bookmark accurately indicates what changes have been applied to the target database. If there is any failure during the application, neither the change data or the bookmark are updated.
If there is any disruption in the replication stream of source transactions, InfoSphere CDC must reconstruct the source transactions with the data read from the database log. The bookmark is the position in that stream that includes all the information necessary for InfoSphere CDC to recreate that stream and position it appropriately.
Whenever replication is restarted, either after normal or abnormal termination, the InfoSphere CDC target agent notifies the source agent of the bookmark. The source agent then positions the reader in the source database and
continues replication.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset