Solution topologies
This chapter describes the solution topologies and the flexible implementations available when using IBM InfoSphere Change Data Capture (InfoSphere CDC). To position these topologies, this chapter describes the potential benefits of each one and provide examples of their use.
Timing and flexibility are everything in today's competitive business environment, especially when it comes to business information. The increasing expectation that services are available 24x7, combined with the growing demand for real-time reporting and analytics, means that data must be constantly accessible.
However, critical business information is not always available to the people who need it. The data might be out of sync or inaccessible due to overloaded system resources, but the result is often the same: reduced productivity and profits, and diminished customer service.
The ability to use existing underutilized systems to offload reporting or data processing helps companies from needing to increase the size or number of single use systems. InfoSphere CDC allows systems to be used in a true
master / master (bidirectional) mode for many years. This proven functionality means that data can be selected for reporting from systems, and users can change data that can be replicated back to the primary or other systems to maintain data consistency. Being able to use the processing power of multiple machines, which might be underutilized, as a single system reduces the need for expensive upgrades or replacement of overloaded production systems.
CDC solutions can support any IT infrastructure or environment. They can replicate or mirror data between heterogeneous database versions, and synchronize data between disparate systems. Many replication products require both source and target databases to be the same version and be on the same platform. This limitation requires upgrades to the database be done simultaneously, which in many cases requires a database outage during
that time.
CDC allows database upgrades to be done independently, which provides maximum uptime of systems, thus reducing the impact of testing and migrating to newer versions. Cross database version replication also allows applications dependent on a particular database version to be kept in sync. CDC allows similar capabilities for cross hardware platforms and operating system versions.
4.1 Unidirectional replication
Unidirectional replication is the movement of data in one direction from the source tables to the target tables, and is used for data redundancy and synchronization (Figure 4-1).
Figure 4-1 InfoSphere CDC unidirectional architecture - source to target
Your source and target tables can be of different types. You can transform the data that you replicate between the source and the target. You can map tables one at a time using standard replication.
The InfoSphere CDC Management Console provides the following two mechanisms for mapping using standard replication:
One-to-one: Map tables using one-to-one replication when you want to map multiple source tables to multiple target tables at a time. These tables share a table structure and similar table names. The Map Tables wizard automatically maps tables based on an example mapping you define and
set up.
Custom: Map tables individually when you want to map only one source table to one target table at a time. These tables are source tables that might not share a table structure or similar table names as the target tables. This option is the option to map each source table in a one at a time fashion.
4.2 Cascading replication
Cascading replication involves a source system transmitting data to a target system which, in turn, serves as a source for the next system in the integration chain (Figure 4-2).
Figure 4-2 Cascading integration
Tools that support cascading integration enable the most efficient movement of data throughout a large organization.
For example, suppose that a company has 12 branch offices and a head office. If it takes an average of 15 minutes for each site to send its nightly data update to head office, the total integration time is three hours. In a cascading integration environment, this time can be cut. If three of the remote sites served as cascade points for the three offices closest to them, the time required to complete the integration process, and the accompanying communication costs incurred, could be reduced dramatically.
Cascading integration streamlines the integration process by enabling organizations to select regional cascade points.
You can use this type of replication to distribute changes across many servers using a multiplier effect. In Figure 4-3, employee data (EMPLOYEES) is replicated to two separate tables (DIVISION1 and DIVISION2), where each table contains data about employees in a specific division. Data from each division table is then replicated to other tables (HIRES1980 and HIRES1990), where the separation is based on hiring dates. Row selection to filter the data into the correct tables is not required if you want to replicate all data in the EMPLOYEE table to all destinations.
Figure 4-3 Cascading replication
4.3 Bidirectional replication
Bidirectional replication involves replication from the publication server to the subscription server, and replication in the opposite direction (Figure 4-4). If both systems are used to change the same tables, recursive updates occur. A change on System A is replicated to System B. The change is then replicated back to System A, and then to System B again. This process repeats itself many times. Bidirectional replication requires recursion prevention to prevent
repetitive changes.
Figure 4-4 Bidirectional replication
Figure 4-5 provides another view of bidirectional replication from a systems and processes point of view.
Figure 4-5 Bidirectional replication - a system view
With bidirectional replication, conflict detection and resolution are required, such as in a situation where the same record is updated on both systems at the same time. If there is a larger data latency with bidirectional replication, then there are transaction conflicts. This situation is why it is important to have the fastest replication possible for this type of replication. InfoSphere CDC provides real-time and bidirectional data integration and transformation between diverse relational databases and other data stores on different platforms. Through bidirectional mirroring, workload distribution allows data to be on more than one machine, with the users segmented between them. This setup can reduce the cost of maintaining a fragmented IT environment by enabling incompatible applications to coexist.
To implement bidirectional replication, you must install InfoSphere CDC on both servers, and each server must be able to send and receive replicated data. One of the benefits of the bidirectional capability of InfoSphere CDC is that it provides a rollback strategy for a migration. This strategy can result in, for example, a zero downtime migration capability. This strategy lets you resynchronize changes to the original source system after cutover when, for example, a post migration problem has occurred. Bidirectional replication can also provide e-commerce application synchronization, workload balancing, and application integration.
Another excellent use for bidirectional topology is data synchronization for upgrades, migrations, and workload balancing. This capability keeps data synchronized between the current production server and a deployed server, for example, to test a new application version upgrade or a hardware or OS upgrade. The workload balancing capability (master to master support) allows database instances to remain synchronized where dual or double data entry is required (such as when data entry is occurring on both systems at the same time) (Figure 4-6).
Figure 4-6 Data synchronization for upgrades, migrations, and workload balancing
4.4 Consolidation replication
InfoSphere CDC also supports the implementation of consolidation replication. In this implementation, data from multiple publication servers update a single subscription table. You must define each publication server separately, and then assign the publication tables to the subscription table that serve as the data warehouse. Because multiple publication tables are updating a single subscription table, the publication tables must have the same attributes.
For example, suppose that you need to create a data warehouse to consolidate employee records created and used by two separate divisions. Each of these divisions has a publication table that is maintained separately. A subscription table is used to consolidate the data from these two tables. On each publication server, you define the publication tables to be mirrored, and then transfer the publication table definitions to the subscription. Then you must work in the subscription environment to assign the publication table definitions to the actual subscription table. When defining the subscription table, a new column is added for the division data. After assigning the publication tables to the subscription table, map the new column to a different constant value in each assignment.
Using the example in Figure 4-7, map the DIVISION column to '01' for the first publication table, and map the DIVISION column to '02' for the second publication table. When data is replicated from one of the publication tables, the corresponding value is written in each row replicated to the subscription table. As a result, rows in the subscription table are identifiable by division. A row identifier defined for a table assignment is used to identify the rows in the table originating from the publication table.
Figure 4-7 Consolidation replication
One reason to use the consolidation topology is to build a low latency operational data store (ODS) for operational reporting and auditing (Figure 4-8).
Figure 4-8 Building a low latency operational data store
This ODS is used for companies:
Looking to manage increasing data volumes and shrink batch windows for ETL processing, and to mitigate risk associated with extracting data from heavily used production environments.
Needing to stream data to an operational data store, where an ETL product, such as DataStage or IBM Cognos® Data Manager, extracts, transforms, and loads data from an ODS into the data warehouse.
Already employing ETL products, but looking to gain operational efficiencies by reducing the impact of extracts or increasing the timeliness of data for loading into data warehouse.
The business value of this setup includes such benefits as:
Real-time operational reporting from the ODS
More frequent ETL processing available by using the ODS as the source for loading the data warehouse
Using your existing investment in ETL processes and tools
Enabling more comprehensive Business Analytics by replicating update and delete operations as inserts into the ODS
There would also be technical value, such as:
Low impact on source environments
No changes to existing infrastructure
Another reason to use the consolidation topology is for off loading production query and reporting cycles. The Reporting server can also be used for consolidation requirements, such as consolidating financial information from multiple branches into a single corporate instance.
Replication frequency generally varies from continuous (near real time) to periodic. Table level refresh or copy can be used in addition to log based change data capture (Figure 4-9).
Figure 4-9 Off loading production query and reporting cycles
4.5 Data distribution replication
You can use InfoSphere CDC to implement data distribution, where a single publication table is used to update multiple subscription tables. As part of data distribution, you must transfer data that is relevant only to the subscription environment. For example, if data is distributed to separate divisions, you must apply row selection criteria so that only data relevant to a particular division is mirrored to its environment.
Frequently, the publication table has a column, such as DEPARTMENT, DIVISION, or COMPANY, that can be referenced in the row selection criteria to define the destination of replicated data (Figure 4-10). However, you can use any column or combination of columns for the selection criteria. Different row selection criteria can be defined for each possible destination. When replication occurs, each row in the publication table is evaluated for the values defined in the row selection criteria. These values determine the destination of the
replicated row.
Figure 4-10 Data distribution implementation
4.6 Hub-and-Spoke replication with propagation
Hub-and-Spoke replication requires a configuration consisting of centrally-administered tables on a hub server that are simultaneously maintained on multiple subscription (spoke) servers (Figure 4-11). Changes applied to the tables on subscription servers (in this example, TORONTO and NEWYORK) are replicated to the same tables on the hub server (HQ), and then routed to the same tables on all other subscription servers in the
Hub-and-Spoke configuration.
Figure 4-11 Hub-and-Spoke replication
In a Hub-and-Spoke configuration, it is possible for changes originating on one spoke to be replicated back to that spoke recursively. To prevent this situation from happening, you must configure propagation control for each hub-to-spoke subscription. Propagation control allows you to specify the corresponding spoke-to-hub subscription when defining the hub-to-spoke subscription.
Figure 4-12 shows a change on the TORONTO spoke being replicated along the spoke-to-hub subscription to the HQ hub.
Figure 4-12 Replicating along a spoke-to-hub subscription
After applying the change to the HQ hub, the hub server then needs to replicate the change to its spokes. There are two subscriptions that replicate data from the hub, HQ_TO and HQ_NY. When defining these subscriptions, you must declare that you do not want HQ_TO to replicate changes that were originally applied to the hub by the TO_HQ subscription. The same is true for the HQ_NY and NY_HQ subscriptions.
Figure 4-13 illustrates how changes should propagate from the hub.
Figure 4-13 Propagation control
Before you can configure propagation control for a hub-to-spoke subscription, you must describe the corresponding spoke-to-hub subscription.
A good example to use for this topology is for e-commerce application synchronization. Here the topology provides continuous bidirectional synchronization between web-based applications and mission-critical business applications. It also helps organizations improve customer online shopping experience with improved visibility into inventory and customer shopping activities.
This situation is shown in Figure 4-14.
Figure 4-14 Application synchronization
4.7 Destination
This section describes the types of InfoSphere CDC destinations (targets) that are available to support InfoSphere CDC Hub (Figure 4-15). Depending on your business and environmental requirements, you use different destination options. For example, you might use a JMS Message Queue for event detection, such as a notification when inventory or balances reach a critically low level. For another example, you might use destinations, such as flat files or DataStage, to eliminate high impact nightly ETL batch processes. You can also use web services as a destination for a part of an event-driven architecture and service-oriented architecture (SOA).
Figure 4-15 InfoSphere CDC architecture - source to InfoSphere CDC Hub
4.7.1 JMS Message Queue
InfoSphere CDC retrieves database operations from the source database and transforms data into XML messages.
Using the InfoSphere CDC Queue Targeting Mapping Designer, you can map XML elements and attributes to columns in a target table. When you start mirroring, InfoSphere CDC sends the XML message to a JMS message destination. You must install and set up an InfoSphere CDC source product that can scrape table-level operations (inserts, updates, and deletes) from a
source database.
This source database represents the production database where your source tables are. When InfoSphere CDC detects a transaction, boundary, or commit operation, it sends the XML message to the JMS message destination.
Here are some examples of the kinds of business events that you can define on your production database:
A new customer sale has been entered into the source database. You might want InfoSphere CDC to send an XML message to a JMS application that generates an event to different departments.
A credit card balance changes significantly in a short period. You might want to track this change and notify the fraud department with real-time information about the credit card changes.
Inventory levels are running low on a particular product. You might want to detect the low inventory and notify production management.
Figure 4-16 shows the architecture to support these business events.
Figure 4-16 Event synchronization through an Enterprise Service Bus
Using the JMS Message Queue as a destination may be beneficial in the following situations:
You are using a JMS compliant message-oriented middleware solution, such as IBM WebSphere MQ Server, WebSphere ESB, Tibco, or BEA WebLogic.
You have built dashboards, KPIs, or composite applications with missing or stale data using messaging-oriented middleware, for example, Emergency Center monitoring to quickly identify patient bottlenecks, perform short-term trending, and identify process improvement and problem areas.
You need real-time event data for business activity monitoring (BAM), for example, detecting changes to inventory in a retail store inventory system and triggering alarms for the store managers in an executive dashboard.
You need event data to build a business process management (BPM) solution, for example, detecting unusual ATM transactions and feeding it into Fraud Prevention process built using Tibco.
4.7.2 Flat files
Flat files can be used by an existing ETL solution or can be used to feed a data warehouse appliance or relational database management system (RDBMS).
When replicating to flat files, choose the directory that holds the flat files. Set the number of rows or time threshold (seconds) when flat files are hardened (marked complete) for processing by the ETL solution. The flat file is closed and the next one is created and opened when either value is reached.
The flat file is only hardened on commit boundaries. If commit cycles are potentially larger or longer than the thresholds, the commit cycles take precedence. As a result, you might end up with more rows in the flat file
than expected.
When you are using flat files, consider the following items:
Can be used by any ETL solution. Most ETL solutions support flat files as
a source.
Suitable for high volumes of changes.
A flat file has a row (or multiple rows) for every table operation, which contain the following information:
 – Timestamp
 – Transaction sequence number
 – Operation type (Insert/Update/Delete (I/U/D))
 – User
 – Before image of row
 – After image of row
You can log the before and after images in a single row in the flat file or create one row with the before image and another row with the after image. This situation can be beneficial if the downstream process only needs the after image of an updated row.
The option for single or multiple record format stored for flat files is shown in Figure 4-17 (with the option to log the before and after image in a single row or multiple rows).
Figure 4-17 Option for single or multiple record format stored for flat files
For a single record, there is one line for an update transaction (U) containing the before update record image followed by the after update record image, as follows:
"2012-01-03 23:20:46","82950","U","DB2INST1","700700700","Timothy Blitz","4 Sugar Forest Dr","IRVING","TEXAS","22598", "2012-01-03", "77", "700700700","Timothy Blitz","80 Grandravine Dr","SAN JOSE", "CALIFORNIA","95101"
For multiple records, there are two lines for an update transaction, with one line containing the before update record image (B) followed by the after update record image (A), as follows:
"2012-01-03 23:29:36","83161","B","DB2INST1","700700700","Timothy Blitz","4 Sugar Forest Dr","IRVING","TEXAS","22598“
 
“2012-01-03 23:29:36","83161","A","DB2INST1","700700700","Timothy Blitz","80 Grandravine Dr","SAN JOSE","CALIFORNIA","95101”
Choosing between single record and multiple records depends the design of the DataStage job. For example, if your DataStage job can safely process an UPDATE as a DELETE/INSERT pair, then using the multiple record format allows you to create a DataStage job that does not need to contain special logic to deal with updates. If your job does need to be aware of updates specifically, then single record format is the most convenient.
Figure 4-18 shows a situation where you replicate to flat files.
Figure 4-18 Replicating to flat files
Figure 4-19 shows replication to a business intelligence (BI) appliance.
Figure 4-19 Replicating to a BI appliance
4.7.3 DataStage
This section describes the four methods of integrating InfoSphere CDC with DataStage (Figure 4-20). The most closely integrated option is Direct Connect, followed by the file-based method. There are other considerations, such as reasonable latency (time it takes from when the transaction occurred to when it was applied to the target), data volumes, and number of tables.
Figure 4-20 Integrating InfoSphere CDC with DataStage
Database staging method
For the database staging method, instead of feeding the changes from InfoSphere CDC directly to DataStage, they are written to a staging area where they are picked up by DataStage and then applied to the target. Having these changes in a staging area might be preferable for customers with multiple applications that require the raw change-only data.
Here is the process for this method:
1. DataStage extracts data for initial load using standard ETL functions.
2. InfoSphere CDC continuously captures changes made to the
source database.
3. InfoSphere CDC continuously writes changes to a set of staging tables using audit type mappings
4. DataStage reads the changes from the staging tables, and transforms and cleans the data as needed.
5. Update the target database with changes.
6. Update internal tracking with the last InfoSphere CDC bookmark processed (The InfoSphere CDC bookmark is involved only to ensure that every change gets into the staging tables).
This method is ideal for the following situations:
Low latency (minutes)
Low / medium data volumes (a few thousand rows per second)
Any number of tables
WebSphere MQ based integration method
The WebSphere MQ based integration is similar to the staging approach. Here, the InfoSphere CDC changes are written to WebSphere MQ, which are then picked up by DataStage and applied to the target warehouse. This approach might be preferable for clients who have an existing messaging infrastructure in their environment, or perhaps their message queue serves as the backbone for other services that also want to use the InfoSphere CDC data directly.
Here is the process for this method:
1. DataStage extracts data for the initial load using standard ETL functions.
2. InfoSphere CDC continuously captures changes made to the
remote database.
3. InfoSphere CDC continuously writes change messages to WebSphere MQ through the InfoSphere CDC event server target.
4. DataStage (through the WebSphere MQ connector) processes messages and passes data off to downstream stages.
5. Updates are written to target database.
This method is ideal for the following situations:
Near real-time integration (seconds)
Low data volumes (hundreds of changes per second)
When the infrastructure uses WebSphere MQ
File-based method
There is also the file-based method, where InfoSphere CDC changes are written to flat files, which are then processed by DataStage. This approach has the benefit that it provides recoverability in the sense that if the source or target machine fails, the replication process can resume from the last processed file. It does require more disk space because a landing area is required for the files.
Here is the process for this method:
1. DataStage extracts data for initial load using standard ETL functions or InfoSphere CDC can be used for refresh.
2. InfoSphere CDC continuously captures changes made to the
source database.
3. InfoSphere CDC DataStage writes one file per table and periodically hardens the files.
4. DataStage reads the changes from the completed files.
5. Update the target database with changes.
This method is ideal for the following situations:
Medium latency (a few minutes or more between periodic batches)
High data volumes requiring parallel loading
Less than 100 tables
Direct connect method
Direct connect is the most integrated approach. To summarize this method at a high level, the changes captured by InfoSphere CDC at the source database are streamed directly into DataStage, where they are then applied by a DataStage job.
Here is the process for this method:
1. DataStage extracts data for initial load using standard ETL functions or InfoSphere CDC can be used for the refresh.
2. InfoSphere CDC continuously captures changes made to source database and flows over TCP/IP to InfoSphere CDC Transaction Stage.
3. InfoSphere CDC Transaction Stage passes data off to downstream stages.
4. Update the target database with changed data. The bookmark persists in the target database along with the client data to maintain end-to-end transactional integrity.
5. InfoSphere CDC continuously captures the changed data from the source systems and passes it to a continuously active DataStage job.
6. The DataStage job receives transactions through the InfoSphere CDC Transaction Stage operator.
7. In the job, the data can be transformed or passed on to other
downstream locations.
8. The InfoSphere CDC bookmark is maintained along with DataStage applying the changes to the target database.
This method is ideal for the following situations:
Near real-time integration (seconds)
Medium data volumes (hundreds to low thousands of rows per second)
Less than 150 tables
Figure 4-21 gives an overview of the process used by InfoSphere CDC to consume data from DataStage.
Figure 4-21 DataStage consumption by InfoSphere CDC
4.7.4 Web services
This section describes how InfoSphere CDC can be an part of an event-driven architecture and service-oriented architecture (SOA).
Event-driven companies, which acquire, deploy, and use real-time information, are most successful at sensing and responding to the changes or events that drive their businesses. This situation calls for a redefining of existing traditional IT architecture towards an event-driven architecture (EDA). The term event-driven architecture refers to any application that reacts intelligently to changes in conditions. Those changes could be a customer registration, a termination of services from a customer, a hardware device malfunction, or a power outage in one region of the country that causes a temporary shutdown or a sudden change in stock price. Depending on the size of the business, there are hundreds or thousands of notable events that occur every minute, every hour, and every day. By nature, some events might be positive, some negative, and some that might provide a business opportunity while others might pose a threat.
SOA is becoming the de facto standard for technical infrastructure in organizations across all industries. It is an approach to building software applications as collections of autonomous services that interact without regard to each other's platform, data structures, or internal algorithms. SOA equips organizations to enable improvements in their ability to react to changing business dynamics and in using their technical assets, helping them gain competitive advantage.
The term second-generation SOA, also referred to as Web 2.0, is the merger of SOA with EDA. The outcome is the creation of new systems that exceed the sum of their parts. EDA and SOA together enable companies to move more quickly towards becoming what are known as real-time enterprises. These companies compete by using the most current information to enable faster and more informed business decisions. They gain a competitive advantage by using such things as new composite applications, real-time dashboards, and key performance indicators (KPIs), and focus on enabling more automated
business processes.
Event-driven architecture: An overview
Events are at the core of EDA. The way in which an event is recognized, enriched, and flowed across a business provides strategic competitive advantage. Companies must aggregate and integrate their data so they can more quickly react to those critical events as they occur.
Discrete data events
Discrete data events are events that are generated from a single system and represent granular activity in the business system. A shipping order, a change in the inventory level, or a new customer added to a customer relationship management system are discrete data events. They are analogous to a row in a table, but might be the aggregate result of a few elements from multiple rows in multiple tables. What categorizes a data event as being discrete is its linkage to a single business entity, such as a customer or a product.
An overview
An example of how InfoSphere CDC supports SOA environments is by packaging data transactions into XML documents and delivering them to applications in real time. This capability can also be used as an event detection solution, capturing mission-critical data transactions in real time and sending them to downstream applications to generate or automate business processes immediately.
Figure 4-22 shows the event detection process.
Figure 4-22 Event detection
Context-rich composite events have the following characteristics:
Combines content associated with the event from other systems.
Routes the data to different applications to initiate business processes based on the content of a message.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset