Test methodology
This chapter provides an insight into the approach that was taken by the CICS performance team when producing performance benchmark results. The concept of a CICS workload is defined, along with a description of how workloads are designed and coded.
Performance testing requires the combination of several techniques to provide accurate, repeatable measurements. These techniques are presented here, while demonstrating some of the tools that were used when collecting performance data.
This chapter includes the following topics:
2.1 Workloads
This book uses the term workload extensively. The term refers to the combination of the following key components of the environment that are used when producing performance figures for a specific CICS configuration:
Application code
Application code can be written in any language that is supported by the CICS environment. The number, sequence, and ordering of EXEC CICS, EXEC SQL, or EXEC DLI commands dictate the flow of control between the application and the IBM CICS Transaction Server (TS) for IBM z/OS (CICS TS) environment under test and is known as the workload logic.
Data that is required by the application
The data that is required by the application can be stored in VSAM files or in an IBM DB2® database, provided by the simulated clients, or supplied by some other external system. The data that is used corresponds to the data that is exchanged between components of CICS as part of a customer’s application.
Topology of connected address spaces
The number of CICS regions, the methods that are used to connect these CICS regions, and the logical partition (LPAR) in which the CICS region is executed all form part of the workload.
Configuration of the CICS region
There are many configuration parameters for CICS and the value for each can be modified to achieve a specific effect.
Simulated clients
The number of simulated clients, their method of communication with the CICS regions under test, and the rate at which requests are sent to the CICS regions can be varied to affect the behavior of a workload.
2.2 Workload design
Performance test workloads that are developed by the CICS TS performance team are deliberately lightweight; that is, workloads have little business logic. The phrase business logic refers to language constructs that serve only to manipulate data according to business rules, rather than the workload logic that is used to control program flow between the application and the CICS TS environment.
The CICS TS performance team specifically target the discovery of performance problems in the CICS TS runtime code, and having lightweight applications maximizes the visibility of any potential problems at the time of development.
The use of a transaction from the data system workload (DSW) as described in 3.2, “Data Systems Workload” on page 22, helps you understand why minimizing business logic is important. Consider the following coding scenarios for the application:
A minimal business logic case with a total transaction CPU cost of 0.337 ms and consisting of the following values:
 – 0.322 ms of CPU for calls into CICS
 – 0.015 ms of CPU for business logic
A more heavyweight business logic case with a total transaction CPU cost of 1.500 ms and consisting of the following values:
 – 0.322 ms of CPU for calls into CICS
 – 1.178 ms of CPU for business logic
In both cases, the amount of CPU consumed by the CICS TS code to complete the CICS operations that are required for the workload is equal to 0.322 ms.
Now consider the example where a change in the CICS TS product during the development phase inadvertently introduces a CPU overhead of 5 µs for each transaction. With the workload in the first scenario (which contains a minimal amount of business logic), the total transaction cost increased from 0.337 ms to 0.342 ms of CPU, an increase of 1.5%. With the workload in the second scenario (which contains significant business logic), the total transaction cost increased from 1.500 ms to 1.505 ms of CPU, an increase of 0.3%.
Although techniques that are used to minimize variability in performance test results are described in 2.3, “Repeatable measurements” on page 13 and 2.6, “Collecting performance data” on page 18, you should note that only a finite level of accuracy in terms of performance test results are achievable. By following leading practices in the CICS TS performance test environment, experience indicates that an accuracy of approximately ±1% can be achieved. The use of coding in the first scenario resulted in a relative performance change (1.5%), which is greater than the measurement accuracy. The small performance degradation was detected and the defect can be corrected.
Minimizing the amount of business logic in the test application maximizes the relative change in performance for the whole workload for any specific modification to the CICS TS runtime code. By using this worst-case test scenario approach, the performance test team can be confident that real-world applications do not observe any change in performance behavior.
 
Observation: For the DSW, an IBM zEnterprise® EC12 model HA1 executes at a rate of approximately 1,270 million instructions per second, per central processor (CP). An inadvertent change that added 5 µs to the total transaction cost represents approximately 6,350 instructions.
2.3 Repeatable measurements
Before describing how performance data is collected, it is important to understand that unless you totally dedicate hardware for a benchmark, the CPU that is used can vary each time that the benchmark is run. Achieving repeatable results can be difficult. This statement is true for benchmark comparisons and also for CPU usage comparisons after a CICS upgrade.
For more information about how CPU time can be affected by other address spaces in the LPAR and other LPARs on the central processor complex (CPC), see IBM CICS Performance Series: Effective Monitoring for CICS Performance Benchmarks, REDP-5170, which is available at this website:
The LPARs that support the CICS regions in all performance benchmarks that are described in this publication include dedicated CPs. Although the CPs are dedicated, the L3 and L4 caches remain shared with other CPs that are used by other LPARs. So, this situation is not perfect; it can lead to CPU variation because those caches can have their data invalidated by those CPs that are used by the other LPARs. Clearly, minimizing the magnitude of these external influences is a high priority when producing reliable performance benchmark results.
An automated measurement system is used to execute the benchmarks and collect the performance data. This automated system executes overnight during a period when no human users are permitted to access the LPAR. The use of an automation system reduces variation in results by ending unnecessary address spaces that can potentially disrupt the measurements. The use of overnight automation also minimizes disruption because that is the time frame during which other LPARs on the CPC are least busy.
2.3.1 Repeatability for Java workloads
Java programs consist of classes, which contain Java bytecode that is platform-neutral, meaning that it is not specific to any hardware or operating system platform. At run time, the Java virtual machine (JVM) compiles Java bytecode into IBM z/Architecture® instructions, using the just-in-time compiler (JIT) component.
Producing highly-optimized z/Architecture instructions from Java bytecode requires processor time and memory. If all Java methods were compiled to the most aggressive level of optimization on first execution, this process results in long application initialization times, along with wasting significant quantities of CPU time optimizing methods that are used only during startup.
To provide a balance between application startup times and long-term performance, the JIT compiler optimizes the bytecode using an iterative process. The JIT compiler maintains a count of the number of times each Java method is called. When the call count of a method exceeds a JIT recompilation threshold, the JIT recompiles the method to a more aggressive level of optimization and resets the method invocation count. This process is repeated until the maximum optimization level is reached. Therefore, often-used methods are compiled soon after the JVM has started, and less-used methods are compiled much later or not at all. The JIT compilation threshold helps the JVM start quickly and still have good long-term performance.
For more information about the operation of the JIT compiler on z/OS, see the topic “The JIT compiler” in IBM Knowledge Center at the following website:
This process of progressively optimizing Java methods leads to a change over time in the amount of CPU consumed by otherwise identical transactions. The first time a transaction is executed in Java, the z/Architecture instructions that are produced by the JIT compiler are at a low optimization level, which results in a relatively high CPU cost to execute the Java methods.
As more transactions are executed, the Java method invocation counts are increased. Therefore, the JIT recompiles a Java method to a more aggressive level of optimization. This greater level of optimization results in a Java method requiring less CPU to execute than before the recompilation took place. As a result, the CPU that is required to execute the transaction reduces. This process is repeated several times during the lifetime of the JVM.
Figure 2-1 illustrates this process for a complex servlet workload in the plot.
Figure 2-1 Plot of CPU cost per transaction over time for a Java workload
Noting that the vertical axis in Figure 2-1 is a logarithmic scale, the first few invocations of the transaction show a relatively high CPU usage. As the transaction is executed multiple times, the JIT compiler optimizes the workload more aggressively. Thus, the CPU cost per transaction reduces over time. Steps can be observed in the CPU cost per transaction value, which are events where high-use methods are further optimized. The frequent spikes in the CPU cost per transaction are due to garbage collection events.
When executing benchmarks that use JVMs, ensure that the JIT compiler has fully optimized the most important Java methods in the workload before starting CPU measurements. To minimize variability introduced by the JIT compiler, run the CICS Java workload at a constant transaction rate for a period of time, known as the warm-up time. After the workload is running in a steady-state for the warm-up period, it is assumed that the JIT compiler will not optimize the workload further, and CPU measurements can be taken.
The warm-up period for a workload is determined by producing a chart, such as the one in Figure 2-1. The warm-up time is the point at which the CPU cost per transaction ceases to show any improvements.
Shutting down a JVM discards the JIT-compiled native code; therefore, the iterative process of optimization begins again when the JVM is restarted. The ahead-of-time (AOT) compiler provides the ability to persist generated native code across subsequent executions of the same program, with the primary goal of improving startup times. The AOT compiler generates native code dynamically while an application runs and caches any generated AOT code in the shared data cache. Subsequent JVMs that execute the method can load and use the AOT code from the shared data cache without incurring the performance decrease experienced with JIT-compiled native code.
Because AOT code must persist over different program executions, AOT-generated code does not perform as well as JIT-generated code. AOT code usually performs better than interpreted code. For more information about the AOT compiler, see the topic “The AOT compiler” in IBM Knowledge Center at the following website:
2.4 Driving the workload
The IBM Workload Simulator for z/OS (Workload Simulator) tool is used to send work into the CICS regions from multiple simulated clients concurrently. For more information about Workload Simulator, refer to the following product web page:
The process of sending work into the CICS regions is commonly referred to as driving the workload. The system under test is on a separate LPAR in the same sysplex. All network traffic is routed by way of a coupling facility from one LPAR to the other.
2.5 Summary of performance monitoring tools
During the benchmark measurement periods, the following tools are used:
2.5.1 RMF Monitor I
IBM RMF™ Monitor I records system resource usage, including CPU, DASD, and storage. It is also used with the workload manager (WLM) configuration to record the CPU, transaction rates, and response times for CICS service classes and report classes.
SMF records 70 - 79 are written on an interval basis. They can be post-processed by using the ERBRMFPP RMF utility program.
2.5.2 RMF Monitor III
RMF Monitor III records the coupling facility activity for the logger and temporary storage structures.
SMF records 70 - 79 are written on an interval basis. Also, the records can be post-processed by using the ERBRMFPP RMF utility program. RMF Monitor III can be used on an interactive basis and the data can be written to VSAM data sets for later review.
2.5.3 CICS TS statistics
CICS statistics are used to monitor and report CICS resource usage, including CPU, storage, file accesses, and the number of requests that were transaction-routed.
With CICS interval statistics, most of the counters are reset at the start of the interval so that any resource consumption that is reported relates only to the observed measurement period. Interval statistics can be activated by using the CEMT SET STATISTICS command. However, when you set this interval, the first interval can be adjusted to a shorter time so that all the intervals are synchronized to the STATEOD parameter. For example, if you use CEMT to set the interval to 15 minutes at 10 past the hour, the first interval expires in 5 minutes so that all future intervals line up on 15-minute wall clock boundaries. The values in this first report also can be associated with a much longer period, depending on the time of the last reset.
Another alternative to the use of interval statistics is to use CEMT to reset the counters and then at the end of the measurement period, use CEMT to record all the statistics. Resetting the statistics requires a change of state from ON to OFF or from OFF to ON. To ensure that this change happens, the following commands provide an example of resetting the statistics in one CICS region:
F CICSA001,CEMT SET STAT OFF RESET
 
F CICSA001,CEMT SET STAT ON RESET
The measurement period is between the RESET and the RECORD, as shown in the following example:
F CICSA001,CEMT PERFORM STAT ALL RECORD
Regardless of whether the statistics are ON or OFF, when a PERFORM STAT ALL RECORD command is issued, a statistics record is written.
CICS statistics are written as SMF 110 subtype 2 records. They can be post-processed by using the CICS statistics utility program, DFHSTUP, or CICS Performance Analyzer (CICS PA).
2.5.4 CICS TS performance class monitoring
When CICS Performance Class Monitoring is turned on by using MNPER=ON in the CICS startup parameters or CEMT or CEMN transactions to turn it on dynamically, a Performance Class Monitoring record is generated for every executed transaction when the transaction ends.
The following command is an example of turning on CICS Performance Class Monitoring and Resource Class Monitoring in one CICS region:
F CICSA001,CEMT SET MON ON PER RESRCE
Monitoring can then be turned off by using the following command:
F CICSA001,CEMT SET MON ON NOPER NORESRCE
The performance class record of each transaction contains information about the resources that were used by that transaction, how much CPU was used on all the various task control blocks (TCBs), and information about how long it waited for different resources. Resource Class Monitoring records contain information about the individual files, temporary storage queues, and distributed program links (DPLs) that were used by transactions.
Monitoring records are written as SMF 110 subtype 1 records that can be analyzed by using CICS PA.
2.5.5 Hardware instrumentation counters and samples
The CPU Measurement Facility (CPU MF) is described in 1.1, “CPU Measurement Facility” on page 4. The CPU MF capability is built into the hardware, and a z/OS component called hardware instrumentation services (HIS) sets up buffers that the hardware then uses to store the sampling data. When a number of buffers are filled, the hardware generates an interrupt. This interrupt enables HIS to asynchronously collect the sampling information and save it to a file in the z/OS UNIX file system. It also provides the ability for the samples to be gathered without the software responsible for collecting the data, having to run at the highest Workload Manager priority level.
HIS can be used to collect the following types of data:
Counters
Instruction samples
HIS counters are written as System Management Facilities (SMF) 113 records and to the z/OS UNIX file system. These counters contain information about key hardware events, such as the number of instructions that are executed, the number of cycles that were used, and the amount of instruction cache and data cache misses. Counters are used to provide a high-level understanding of how the address spaces interact with the hardware.
HIS instruction samples are written only to the z/OS UNIX file system. The samples are used to provide a view of CPU activity for individual instructions or groups of instructions. Tooling enables the inspection of this data to help the CICS performance team understand where hot spots exist in the CICS runtime code. Hot spots are short sequences of one or two machine instructions that consume a disproportionately large fraction of the total CPU cost. These hot spots are frequently caused by data access patterns that do not make optimal use of the hardware cache subsystem. Tooling that is written to consume HIS instruction samples also permits the comparison of two benchmark runs, where differences in performance can be analyzed at the instruction level.
For more information about configuring and the use of HIS, refer to Setting Up and Using the IBM System z CPU Measurement Facility with z/OS, REDP-4727, which is available at this website:
2.6 Collecting performance data
Performance data often is collected for five measurement intervals. The rate at which work is driven into CICS is varied by adjusting the Workload Simulator user think time interval (UTI). The UTI value represents the delay between a simulated client that is receiving a response, and then sending the next request into CICS. A large think time results in a low rate of transactions in CICS. Reducing the UTI increases the rate at which work is driven into the CICS environment.
The initial measurement period begins by adjusting the UTI to achieve the required transaction rate in the CICS regions. The workload can run for a period to ensure that all programs were loaded and the local shared resource (LSR) pools are populated. After the stabilization period is complete, the performance data collection is started.
No specific changes to any default CICS parameters are needed to support the data that is collected during performance benchmarks. Data is collected for a 5-minute period, which is relatively short but adequate in our environment when running in a steady-state.
RMF, CICS Performance Class Monitoring, CICS statistics, and HIS are all synchronized and started and ended together. An automation tool is used that enters commands on the IBM MVS™ console on a time-based interval.
To generate the RMF interval, start and stop RMF at the appropriate times, which creates an interval report for that period rather than trying to synchronize on a time basis.
When the workload is running at its stabilized state, the CICS statistics are reset by using the commands that are described in 2.5.3, “CICS TS statistics” on page 16. CICS Performance Class Monitoring is turned on by using the commands that are shown in 2.5.4, “CICS TS performance class monitoring” on page 17. RMF Monitor I was started by using the following MVS command:
S RMF.R
Monitor III is then started by using the following command:
F R,START III
HIS is also started to collect counter data only.
After 5 minutes elapses, RMF and HIS are stopped, and the command that is shown in 2.5.3, “CICS TS statistics” on page 16 is issued to request that CICS statistics are recorded.
After the performance data collection period ends, the UTI is reduced, which increases the transaction rate in CICS. Again, the workload is allowed to run for a period to ensure that the system reaches a steady-state. After this stabilization period is complete, the performance data collection is restarted.
After five cycles of UTI adjustment and data collection, a set of data is produced which represents the performance of the CICS regions at several transaction rates. The SMF data set that contains the collected RMF, CICS, and HIS performance data is copied for later post-processing and analysis to examine the performance characteristics of the workload.
2.6.1 Collecting Java performance data
CICS performance class monitoring data does not account for all the CPU time that is consumed by a CICS region. Areas where the time spent is not included in the monitoring data can include the following examples:
Non-CICS TCBs
Service request blocks (SRBs) for networking or system calls
Request initialization (that is, before a CICS task is established)
Request termination (that is, after the CICS task monitoring data is written)
When running a Java workload, this uncaptured time is larger than that observed for more traditional workloads. This increased discrepancy happens for the following reasons:
A running JVM has several non-CICS TCBs executing to perform critical functions. The most significant of these functions are garbage collection (GC) and JIT compilation. GC and JIT TCBs can use non-trivial amounts of CPU in the JVM.
For applications using an IBM WebSphere Application Server Liberty (WebSphere Liberty) JVM server, the initial HTTP or HTTPS request is accepted in Java code. Therefore, a non-trivial amount of Java code is executed before CICS is notified of the request and, thus, before a CICS task is established.
For Java applications running in an OSGi JVM server, the discrepancy is lower than a WebSphere Liberty JVM server, because a CICS task is always established before invoking the OSGi JVM. The uncaptured time, therefore, is mostly due to the GC and JIT TCBs identified previously. This discrepancy is studied for two Java workloads in 7.16.4, “Comparison of CICS monitoring and RMF data” on page 147.
Given the potential for large amounts of uncaptured CPU time, it is important to use CPU time information measured at an address space level when analyzing the performance of CICS applications that use Java.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset