Chapter 8. Performance analysis and design considerations

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Performance analysis and design considerations

This chapter describes the high-level concepts around performance tuning and optimization in IBM InfoSphere Change Data Capture (InfoSphere CDC). This chapter includes information about understanding the InfoSphere CDC architecture, highlighting the potential bottlenecks, and documenting monitoring methods and considerations for designing InfoSphere CDC systems in large environments.

More information about CDC performance tuning can be found at the
following address:

http://publib.boulder.ibm.com/infocenter/cdc/v6r5m1/topic/com.ibm.cdcdoc.performancetuning.doc/concepts/overviewofinfospherecdc.html

8.1 High volume between two systems

Throughput and latency are the two main considerations in large environments. It is important to first understand the concept of throughput and latency in the context of InfoSphere CDC.

8.1.1 Latency and throughput

Latency, sometimes referred to as replication lag, is the amount of time between an update applied to the source system and the same update applied to the target system. The shorter the duration, the lower the latency.

There are three types of latency:

•Data latency is the time delay in data transfer from source systems to target systems

•Analysis latency is the time delay that includes data creation and processing (ETL/ELT)

•Action latency is the time delay in getting information from data by using
BI tools.

Figure 8-1 shows the types of latency.

Figure 8-1 Types of latency

Throughput is the quantity of data processed within a given duration of time. High volumes of data changes from the source system require higher throughput if lower latency is wanted.

Latency is not throughput, but in high volume environments, latency is directly affected by available throughput.

8.1.2 InfoSphere CDC architecture

There are multiple components in the InfoSphere CDC architecture and there could be bottlenecks in each of these components. The major components in InfoSphere CDC include the following:

•Log reader: The log reader reads entries from the database logs (including redo logs and transaction logs) and checks to make sure that the transaction falls in-scope. When in-scope, the entries are placed in a transaction queue. Uncommitted transactions may be read, as these transactions might happen in parallel and are typically committed at different times. The transaction queue acts as a staging area for these transactions.

•Log parser: The log parser reads committed transactions from the transaction queue and extracts individual columns from the row level operations.

•Staging store: The staging store is an in-memory staging area for records processed by the log parser. The data remains in the staging store until all subscriptions have completed.

Figure 8-2 shows a high-level view of the architecture.

Figure 8-2 InfoSphere CDC architecture

Potential bottlenecks

InfoSphere CDC based data replication operates as a pipeline of components that move data changes along their way from the source database to the target system. As with any system, InfoSphere CDC is only as fast as the slowest component in the architecture. The following components could be potential bottlenecks in the replication pipeline:

•Log reader

•Source engine

•Network communication

•Target engine

•Target database

Latency is usually the first indication of a performance issue in a data replication environment. Here is a list of some factors that have a direct relationship to increased latency and reduced throughput:

•Size of source database transactions

•Source database delay in writing record changes to the database logs

•Relative size and performance of the source database and target systems

•Hardware factors, such as available physical memory and processor, and disk I/O performance

•Complexity of data transformations

•Available communication bandwidth

8.2 Identification of potential bottlenecks

There are three main areas that can have performance bottlenecks. They are the source systems, communications, and the target systems.

The source side bottlenecks could be due to one of the following items:

•Log reader: Reading the logs involves I/O and, in scenarios where the database logs are placed on an NFS mount, all this log data needs to be transferred across the network. Bottlenecks in the log reader can be identified by comparing the elapsed time of copying the log file to another volume to the amount of time taken to scrape the logs. This time should be comparable.

•Log parser: If the source tables have many columns or wide records, parsing can take a long time and might even reach the processor boundaries.

•Derived expressions: Usage of derived columns and expressions can result in additional processor usage. Bottlenecks due to derived expressions can be identified by replacing derived expressions with constant values.

There could also be bottlenecks in the communication layer, depending on the available bandwidth for InfoSphere CDC. These types of bottlenecks can be identified by testing an intra-system data replication. Another way of identifying a communication bottleneck is by copying a large file from the source to the target system using FTP, which should give an indication of the data replication speed. If FTP cannot use all of the available bandwidth, the same constraint applies for InfoSphere CDC.

A ping command should not be used to measure throughput, as the command is used for small packets of data.

There is also a potential for bottlenecks in the target side. Target system-related bottlenecks can be identified by disabling insert, update, and delete operations for individual tables. This method does all the processing in the InfoSphere CDC layer except for modifying the target system tables.

The InfoSphere CDC Management Console has a feature, the Performance Monitor, which helps identify bottlenecks in the described areas. This feature can be used if the InfoSphere CDC version of both the source and target engine is Version 6.5 or higher.

8.3 Performance monitoring in InfoSphere CDC environments

Diagnosing and resolving performance issues is an iterative process in InfoSphere CDC. High latency and resulting performance issues can be identified using one of the performance monitoring tools described in this section.

8.3.1 Performance monitoring using the Management Console

Management Console includes an event log and monitoring tools. The event log can be used to examine InfoSphere CDC event messages. The monitoring tool allows you to continuously monitor replication operations and latency. The monitor in Management Console is meant for time-critical working environments that require continuous analysis of data movement and all statistics are real-time statistics. The monitoring data can be exported for historical analysis.

After configuring the replication environment and starting the data replication process in InfoSphere CDC, you can monitor and analyze replication activities using the Monitoring perspective.

8.3.2 System monitoring tools

Any system monitoring tool can be used to identify high-level performance issues attributed to processor utilization, memory usage, and any I/O bottlenecks.

8.4 Using workflow for performance issues

There are three high-level steps to identifying and resolving performance issues by analyzing the InfoSphere CDC workflow. These steps are:

1. Configure visual cues to highlight latency issues.

InfoSphere CDC shows a visual cue in the Monitoring perspective when subscription latency exceeds a predetermined value based on business requirements. The latency threshold can be configured in InfoSphere CDC Management Console by completing the following steps:

a. Click Configurations → Subscriptions.

b. Right-click a subscription and select Latency Thresholds.

c. Select the Notify when latency crosses these thresholds check box.

d. Specify a value in Warning threshold (minutes).

e. Specify a value in Problem threshold (minutes).

f. Click Set Notification.

2. Overview of data replication problem identification.

Look at the high-level system overview to monitor the data replication process for initial analysis of the data latency issues. To see a data replication system overview, complete the following steps:

a. Click Monitoring → Subscriptions.

b. Select the subscription that needs to be monitored.

c. Click Mirroring or Refresh for the selected subscription.

d. Right-click the subscription and select Collect Statistics.

e. Double-click the titles at the upper right to show an overview of the replication environment. This action also enables quick identification of latency issues that might affect replication performance.

Figure 8-3 Problem identification

3. Analyze performance statistics.

Further analysis can be performed by collecting additional metrics at the instance level. To collect additional metrics, complete the following steps:

a. Select a subscription that is already collecting real-time statistics.

b. Right-click the subscription and click Show in Performance View.

c. Select Statistics Checkboxes. Up to 10 metrics can be chosen.

d. Click Collect Data. Any bottleneck can be shown at the top of this view. A bar next to the name of the InfoSphere CDC component identifies the bottleneck. With InfoSphere CDC, only one component can have a bottleneck at any given point in time.

Figure 8-4 Bottlenecks

8.5 Installation considerations

Large-scale distribution and consolidation implementations of InfoSphere CDC have many source and target instances. Each of these source and target systems need to have a InfoSphere CDC instance installed in them. It might not always be feasible to manually deploy the solution for every server and database in the entire landscape.

8.5.1 Silent installations and instance creation

The InfoSphere CDC engine supports automated deployment and instance configuration using scripting. InfoSphere CDC uses a response file to create an InfoSphere CDC instance for all databases it needs to use as source or target systems. Before being able to remotely deploy InfoSphere CDC in an automated way, the following preparatory steps need to be conducted:

1. Create template databases (for source and target systems) for the installation of InfoSphere CDC. The InfoSphere CDC user that holds the metadata must be created and the databases must have the correct settings for (supplemental) logging and archiving the transaction logs. A script to do all the database preparation steps enables this step to be automatically performed at the remote site.

2. Install InfoSphere CDC on the server that holds the template database and specify the -r responsefile option to create a response file to be used to install the production servers. You need to do this step for both the source and target template servers. An example response file for installation is shown in Figure 8-5.

Figure 8-5 Response file installation

After the installation is finished, a response file is generated. An example is shown in Figure 8-6.

Figure 8-6 Sample response file

3. Create the InfoSphere CDC instance for the template databases.

4. Configure the subscriptions and table mappings.

5. Export the instance configuration by running dmexportconfiguration. Running this command prompts you for the password of the database user that is the owner of the InfoSphere CDC metadata that will be used on the production installation when the instance is imported. The password is encrypted in the generated XML file. You need to do this step for both source and target instances.

If needed, parameters such as <name>, <tcpPort>, <dbName>, and <dbSchema> can be overridden in the XML file to accommodate the production database settings. As the database user name and password are encrypted in the XML file, these items should not be changed. Optionally, remove the InfoSphere CDC instances and installation from the template servers.

For unattended deployment at the remote site, complete the following steps:

1. Prepare the production databases (source and target) for the installation of InfoSphere CDC. The InfoSphere CDC user that holds the metadata must be created and the database must have the correct settings for (supplemental) logging and archiving transaction logs. If a script has been created for this task, run the script.

2. Install InfoSphere CDC on the production servers using the response files that were created on the template servers (specify -i silent -f responsefile). An example of the response file generation is shown in Figure 8-7.

Figure 8-7 Response file generation

3. After the installation is finished, change the directory to the cdc_home/bin directory and import the instance configuration by running dmimportconfiguration. This step needs to be done for both source and target instances. After the import of the instance configuration is finished, the metadata tables are created in the database. An example of the instance configuration is shown in Figure 8-8.

Figure 8-8 InfoSphere CDC instance configuration

Upon finalizing the import, the InfoSphere CDC instance is automatically started and is ready to be used. For manual and automated configuration options, see Chapter 9, “Customization and automation” on page 231.

8.6 Design considerations

Sometimes a standard InfoSphere CDC configuration might not scale based on the workload or the latency might be too high. In such instances, alternative architectures need to be considered. Some of these alternative architectures might include using:

•Multiple parallel subscriptions

•Multiple InfoSphere CDC instances

•An n-tiered architecture

•Cascading replication

•Continuous scraping

Each of these alternative architectures is applicable and is suitable under certain circumstances. However, not all platforms support all of these alternative architectures. The remainder of this section goes through each of the alternative architectures and highlights any exceptions or challenges associated with them.

At the time of the publication of this book, the Performance Monitoring and Tuning guide for InfoSphere CDC is available at the following address:

http://publib.boulder.ibm.com/infocenter/cdc/v6r5m1/topic/com.ibm.cdcdoc.performancetuning.doc/concepts/overviewofinfospherecdc.html

See this document for more advanced monitoring and tuning techniques.

8.6.1 Using multiple parallel subscriptions

Data replication from a single table can be optimized by using multiple subscriptions. The available data can be partitioned by using row filtering so multiple subscriptions can operate concurrently to replicate changes to the target systems. This scenario is most suitable when there is a numeric column that uniquely identifies every row in the table. This column can be used to partition the data across multiple subscriptions.

This scenario does not work if the column used to partition the data is updated.

A Java based user exit (see 9.7, “User exits” on page 369 for more information about user exits) can be used to tag rows of data to be directed to corresponding subscriptions. This task is achieved by implementing a MODULO function on the numeric column that uniquely identifies every row of data in the table. As an example, to partition the data, such as the data shown in Table 8-1, into three subscriptions, the Java based user exit performs a MODULO 3 on the numeric key column and uses the value as a predicate filter for the three subscriptions. Subscription 1 has a predicate of MODULO_CUSTOMER_ID=1, Subscription 2 has a predicate of MODULO_CUSTOMER_ID=2, and Subscription 3 has a predicate of MODULO_CUSTOMER_ID=0.

Table 8-1 Sample source table

Customer ID	MODULO_CUSTOMER_ID	Customer name
31	1	Customer 31
32	2	Customer 32
33	0	Customer 33
42	2	Customer 42
76	1	Customer 75

The MODULO_CUSTOMER_ID column is derived by applying the MODULO 3 function on the Customer ID column from within a Java based user exit.

The three subscriptions for this source table have row-based filtering on the MODULO_CUSTOMER_ID column. Subscription 1 handles rows with Customer IDs (31,76), Subscription 2 handles rows with Customer ID (32,42), and Subscription 3 handles the row with Customer ID 33. All three subscriptions populate the same target system, so the throughput is much higher, which reduces latency and increases performance. The architecture for this configuration is shown in Figure 8-9.

Figure 8-9 Multiple parallel subscription architecture

At the time of the publishing of this book, InfoSphere CDC V6.5.2 was released, which introduced fast apply techniques to improve the throughput when applying transactions on the target database. More information can be found at the following address:

http://publib.boulder.ibm.com/infocenter/cdc/v6r5m2/topic/com.ibm.cdcdoc.performancetuning.doc/concepts/improvingtargetdatabaseperformancewithfastapply.html

8.6.2 Using multiple InfoSphere CDC instances

Distributing the workload across multiple InfoSphere CDC instances can also be used to reduce latency and increase throughput (Figure 8-10).

Figure 8-10 Multiple InfoSphere CDC instances architecture

Every InfoSphere CDC instance has its own log reader, which might have some performance implications on the source side, but the overall benefit of higher throughput in large data replication environments out-weighs the additional work done in the source systems.

Each InfoSphere CDC instance in the architecture shown in Figure 8-10 could either be set up to replicate different tables or they might be set up to replicate the same table with different row-based filters. Depending on the data volumes in the individual tables, one or a combination of both of the methodologies can be used.

8.6.3 Using an n-tiered architecture

N-tiered data replication in InfoSphere CDC means that the InfoSphere CDC instance is on a different server than the database it sources or populates. An n-tiered architecture can operate as follows:

•InfoSphere CDC reads logs coming from a different server (remote
log reading).

•InfoSphere CDC applies transactions to a database that is on a different server (remote apply).

Typical applications of an n-tiered architecture

The three common scenarios for the n-tiered architecture are:

•Compliance, security, and ownership reasons, including:

– Database administrators not comfortable installing third-party applications (InfoSphere CDC) on the database server.

– A database server is owned by the business and they want to mitigate any risk in the disruption of the database servers.

•Software restrictions on the database server. The operating system patch levels might not be at a required level for InfoSphere CDC.

•Source system is at capacity.

– No additional processing capability for InfoSphere CDC.

– The source system might be an existing system that does not have sufficient cycles to process high volumes of database logs.

The InfoSphere CDC n-tiered architectures are differentiated by the location of the source and target instances.

InfoSphere CDC instances on source database server

In this configuration, the InfoSphere CDC instance for both the source and target systems are located in the source database server. This configuration is called Option 1 (Figure 8-11).

Figure 8-11 N-tiered architecture - Option 1

InfoSphere CDC instances on the target database server

In the configuration shown in Figure 8-12, the InfoSphere CDC instance for both the source and target systems are on the target database server. This option is called Option 2.

Figure 8-12 N-tiered architecture - Option 2

InfoSphere CDC instances on separate database servers

In the configuration shown in Figure 8-13, the InfoSphere CDC instances for both the source and target systems are on a centralized InfoSphere CDC server.

Figure 8-13 N-tiered architecture - Option 3

8.6.4 Using cascading replication to spread the workload

Using cascading replication, the performance and latency can be vastly improved as a result of the multiplier effect. Similar to having multiple subscriptions for a given table with row level filters, this method distributes the workload across multiple subscriptions. This method can be used with the multiple subscription method. Use the Java based user exit to perform a MODULO on a numeric column in the source table and replicate the data based on different predicate values to different target systems. These target systems can then be used as a source to load and consolidate data rows in the final destination table. Cascading replication is shown in Figure 8-14.

Figure 8-14 Cascading replication

8.6.5 Continuous scraping

In large distribution environments, one of the biggest challenges is purging source database logs. If all target systems are not simultaneously active or if there are unstable network connections between source and target, there could be some target systems that are not synchronized with the source.

InfoSphere CDC needs access to the database log entry that is represented by the oldest bookmark of any of the subscriptions. When there is a network outage, InfoSphere CDC stops reading the database transaction logs and the bookmark of the subscription is no longer advanced. After it is restarted, the subscription requests the bookmark from the target system and restarts from the position corresponding to the bookmark in the log. If this log entry is no longer available because log files have been purged, the replication does not start. The log files must be restored for normal operation to continue. In some instances, the database logs may no longer be recovered and a resynchronization (refresh) of all tables must be done.

Besides the risk of losing changed data, a prolonged outage of the replication could result in resource utilization spikes when subscriptions are restarted. When replication is started in mirror mode, InfoSphere CDC always attempts to catch up to the head of the database transaction log and thus use any resource it can obtain from the operating system to achieve this goal. To optimize this operation and reduce the impact of InfoSphere CDC in large distribution environments, you should consider enabling the Continuous Capture feature, which is available in all engines supporting single scrape.

Enabling continuous capture causes the log reader process to be separated from the replication of the logical database operations. The InfoSphere CDC source engine actively reads the database logs at all times and processes the in-scope entries. Entries (operations) are sent to transaction queues and, once committed, they are parsed and accumulated into the staging store. The staging store is kept until all the subscriptions have completed processing. Even when there are no subscriptions active, the engine continues to read and process the database log entries and strives to stay at the head of the log.

When the inactive subscriptions are restarted, the parsed log entries are already in the staging store and can be processed immediately by the subscriptions, saving a fair amount of processing time. The staging store entries are cleaned up as soon as subscriptions no longer need them.

As the staging store on the source server is accumulating the log changes, its size increases. You must plan your source environment to have sufficient disk space. More importantly, the maximum size of the staging store that was specified when the instance was created must be equal to or lower than the available disk space.

Quota and disk capacities: Running out of quota for the staging store pauses scrape activities. However, running out of physical disk space causes the log scraper to end abnormally and render the staging store corrupted. In such an event, the staging store must be cleared (by running dmclearstagingstore) and the continuous capture process must start reading the logs from the oldest bookmark.

Continuous scraping of the database logs is incompatible with the independent staging store capability of InfoSphere CDC. If you choose to enable continuous scraping, all subscriptions sourcing the instance must share the single scrape staging store. Should your environment need a more flexible setup where some subscriptions must be able to have their own independent staging store, consider creating an additional InfoSphere CDC instance to run those subscriptions.

The data store system parameter staging_store_can_run_independently controls the use of a single scrape staging store. The default setting of this parameter is true, which means that subscriptions can create their own log reader process if a database log entry cannot be found in the staging store. Changing the parameter to false causes all subscriptions to work off the same staging store, in effect creating a single scrape staging store.

After the single scrape staging store has been activated, the dmenablecontinuouscapture and dmdisablecontinuouscapture commands can be used to start and stop continuous scrape. At all times, the staging store status can be monitored using the dmgetstagingstorestatus command.

Enabling continuous capture and check status is shown in Figure 8-15.

Figure 8-15 Enable continuous capture and check status

One of the subscriptions that is sourcing this InfoSphere CDC instance has a table mapped for Mirror and transactions are being generated, but the subscription is inactive. Now, after transactions are generated and the staging store status is monitored, it gradually starts filling up. In Figure 8-16, the staging store is 9% full and the capture process is reading the head of the log, given the fact that the capture latency is 0%.

Figure 8-16 Staging store status

Should the InfoSphere CDC instance be stopped at some point, the single scrape staging store is persisted to disk. When restarting the instance, the continuous capture is automatically restarted and continues from the last read log entry.

Abnormal termination: Abnormal termination of the InfoSphere CDC instance renders the staging store corrupted. Always shut down the InfoSphere CDC instance by running dmshutdown to avoid the situation where the staging store must be cleared and the log reader must restart from the earliest bookmark position.

Consider the log retention policy that can be adapted when using continuous capture. The suggested approach for log retention is to always keep the database logs until transactions have been applied onto the target to avoid restoring them in the event the InfoSphere CDC instance terminates abnormally. However, in large distribution environments where the target systems are not always active, imposing the recommended log retention policy might incur challenges in terms of disk space and manageability of the database server.

The dmshowlogdependency command can be used to list the database logs that are still required by InfoSphere CDC. The -i -A option of this command lists all database logs starting from the oldest bookmark.

Figure 8-17 shows a list of database logs that are still required by
InfoSphere CDC.

Figure 8-17 Identifying database logs required by InfoSphere CDC

As an alternative to choosing a log retention policy that keeps all database logs on the system that potentially must be read later, you could consider purging all database logs before the log file that is currently read by the continuous capture process. The dmshowlogdependency -l -A command shows the currently read log file (Figure 8-18).

Figure 8-18 Identifying logs that can be purged

Purging the database logs: If the database logs are purged up to the currently processed log file (disregarding the bookmark), when there is a system or product failure, the logs must be restored or all tables need to
be refreshed.

For more information about database log retention, see Chapter 7, “Environmental considerations” on page 137.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8. Performance analysis and design considerations

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 8. Performance analysis and design considerations