Operational considerations
This chapter describes operational considerations relating to hosting cloud services on IBM z/OS and IBM CICS Transaction Server for z/OS. The authors share the experience of the Walmart team and demonstrate how Walmart dealt with these challenges.
This chapter includes the following topics:
 
6.1 Capacity planning
Capacity planning is an essential component of any mainframe environment. Capacity planners are primarily concerned with ensuring that applications perform reliably and can handle spikes or increases in usage without negative effects on quality of service; that is, still operating under agreed-upon service level agreements (SLAs). Another concern is to ensure that the CPU usage for the software components is within the desired constraints.
Capacity planners can use many tools to achieve these goals. In a CICS environment, many of the tools are focused on the collection and analysis of System Management Facility (SMF) records that are output by CICS Monitoring Facility (CMF) that uses SMF type 110 records. It is essential in a service provider situation to properly understand and have good communication with your capacity planners. This relationship can yield the following benefits:
Sharing of tooling and data. Within the Walmart organization, many of the same tools that are used by the capacity planners can be used in monitoring and measurement. For more information, see 6.3, “Metering and measurement” on page 61.
Mitigate any concerns that capacity planners might have by being transparent in what services are hosted in CICS.
Ensure that your environments have the capacity to grow with your needs.
6.1.1 The concerns of a capacity planner
It is essential to understand the concerns that a capacity planner might have and do all you can to mitigate these concerns.
CPU usage
One initial reaction from adding these services to CICS, or making these services available, might be the higher cost because of an increase in CPU consumption.
Through the implementation of the caching service, Walmart found the opposite was true; that is, CPU usage did not increase. The caching service reduced the processing that is required to fetch resources that are now cached, which resulted in a net reduction in CPU usage.
An example within Walmart was a CICS web service application that ran a complex IBM DB2 query, which, because of its complexity, was run for a two-week period only, six times per year. During the morning batch runs of updates to the DB2 tables, the information was also loaded to this application’s caching table. The query to DB2 in the CICS web service was replaced by a call to the caching service, which significantly reduced I/Os and subsequent CPU cycles.
Monitoring service usage
All current application and system monitoring tools, including tools that use SMF data, are still applicable to monitoring services. These tools give capacity planners the ability to view the usage of the environment as a whole. With an intelligent, well thought out design, these tools allow capacity planners, CICS systems programmers, and service providers the ability to view on a granular level which consumer is using which resources.
6.1.2 Partnership with capacity planners
Capacity planners focus on the mainframe environment at a high level. Within Walmart, the addition of microservices has not affected the mainframe environment at this level.
Although this is the case (as you have seen through multiple examples in this book) the true key to success in the new mainframe world is developing partnerships with the teams throughout your organization. With these partnerships in mind, building notifications into the tool to notify interested parties can greatly increase the visibility of the environment usage among teams as part of an overall provisioning solution.
For example, when a new microservice is provisioned, notifying the capacity planners allows them to properly monitor trends in the environment usage and provide data, such as MIPS usage, which can be used to project future requirements on your systems.
6.2 Scaling and elasticity
Elasticity or elastic scale is one of the key concepts involved when building cloud services. It is also one of the five National Institute of Standards and Technology (NIST) essential characteristics of cloud. For more information about these characteristics, see 1.3, “How z/OS and CICS are relevant to cloud” on page 3, and 3.3, “Five essential characteristics of cloud” on page 17.
Elasticity, or elastic scale, refers to the need for the service to be scalable to meet the needs of the service consumer. For z/OS based services, this scaling should happen on-demand (within pricing restrictions, see 6.3, “Metering and measurement” on page 61) and seamlessly.
The service consumer should not be aware on any level of the requirements of the system and the environment should grow according to needs. In addition to the service consumer requiring the service to grow, they are also concerned with the inverse; that is, having a provisioned environment that scales back when demand for a service decreases.
For more information about scaling and elasticity in a cloud environment, see Creating IBM z/OS Cloud Services, SG24-8324.
The following methods are suggested in this section to achieve elasticity of services:
Run with spare capacity: Create a large environment that can handle increases in demand to maximums that the consumer requires without the requirement to dynamically grow.
On-demand growth: Create environments that run at capacity that can grow and shrink to meet increases and decreases in demand.
6.3 Metering and measurement
The key to creating a successful cloud service is good service design. The design of the cloud services was described at great length throughout this book. This section describes how effective design makes measurement of usage of provisioned services easy to monitor and control.
Within the Walmart environment, monitoring of service usage and control of that usage based on limits is achieved on multiple levels. Service providers at Walmart monitor system usage by using the same tools as capacity planners use to monitor the IBM z Systems environments. These tools analyze SMF data, which includes SMF type 110 records output by CICS, SMF type 30 records for common address space work, and SMF type 7x records for IBM Resource Measurement Facility (RMF) data.
How can service providers use SMF data to measure usage for individual consumers? In the Walmart multi-tenant environment, work is isolated, based on enforcing strict naming conventions throughout the entire service. Figure 6-1 shows how a workload is tracked by using the transaction ID that is unique to a particular consumer.
Figure 6-1 Metering service usage with unique transaction IDs
Each request that comes into the environment is routed by using a URIMAP resource (unique to a consumer) to a unique transaction ID. SMF data for this transaction is then output to SMF log streams. For live analysis of the environment, Walmart does not wait for the data to be output to log streams. By making use of an exit at the XMNOUT Global User Exit Point, Walmart captures transaction data as it is being sent to SMF and avoids the need to decompress the data output to SMF log streams.
The ability to monitor each individual tenant’s service usage is essential when calculating how much to charge for service usage.
6.3.1 Charging for service usage
Through monitoring the transaction information output from SMF data, a service provider can calculate how many times a transaction is started. This transaction count can be used for billing and usage capping. The following approaches are used for billing:
Projected-usage charging
Pay per transaction
Projected-usage charging
By using projected-usage charging, the service consumer provides an estimate for how many requests (which directly map to transactions) they expect to use in a specific period. Consumers then pay based on this projected rate.
If a consumer then exceeds this prepaid limit, policies can be implemented to restrict transaction flow rate until more capacity is arranged. These restrictions are implemented by using transaction classes.
When a service consumer exceeds an agreed capacity by a certain percentage, the service is then halted and it returns an HTTP return code 404 to the requester. The service consumer can then purchase more capacity, service restrictions are removed, and limits are reset based on this newly defined capacity.
Pay per transaction
An alternative to charging based on projected-usage is to charge per transaction; that is, charging for actual usage. This pricing model can be restricted by the organization’s internal charging capabilities if it cannot charge on per usage basis. The same tools can be used to measure transaction throughput by using SMF data and then charging the consumer for their usage at a specific interval.
6.3.2 Use of measurement information for scaling
In addition to using the monitoring data output by CICS for billing, you can use this data to enable further automation within your environment. By sing this SMF data, you can automatically provision and deprovision CICS as required.
6.3.3 Evolution of service-oriented architecture to cloud service
It is important to highlight how the thinking on this topic evolved to meet the new cloud service model. It is not the first time Walmart made core services available through a web services interface. Walmart used technologies, such as CICS web services, in the past to make functions available within CICS.
These previous experiences highlighted to Walmart how metering and measurement are essential for a reliable services interface. An example of this necessity was a previous CICS web services application that was used by an internal development team within Walmart. The use of this service was steady for some time, and then the CICS operations team noticed a significant increase in the usage of the service. The application owners insisted that the new traffic was not because of their application. After some investigation, the operations team realized that it was another development team that was using the service through discovery of the published WSDL.
This experience was taken into account in the design of Walmart’s recent microservices. The multi-tenant, isolated, and consumer-provisioned services provide the ability to monitor exactly who is using the service (through individual URIs, transactions, and user IDs). The consumer is responsible for controlling access to that instance of the service and the credentials to access the service.
6.4 Maintenance
This section describes the considerations in maintaining your CICS environments that host your provisioned services.
The main principles of maintaining this type of environment are the same as maintaining any other highly available CICS environment. The same considerations of maintaining availability of your services through any maintenance cycle are still applicable. As a guide, the points the are described in this section highlight some of these considerations.
z/OS maintenance and upgrade
It is unlikely that z/OS maintenance will be a consideration of the service providers, unless they share z/OS operations responsibilities. However, it is likely that the z/OS operations teams will need to perform rolling maintenance and upgrades to the production environment, performing maintenance on each LPAR, one at a time.
This rolling upgrade approach allows you to use the high availability (HA) features that are built in z/OS to allow normal operations to continue. In Walmart, the service providers used DNS forwarding (allowing switching between sysplexes), IBM Parallel Sysplex, sysplex distributor, virtual IP addresses (VIPAs), and TCP/IP port sharing to achieve HA at the sysplex and LPAR levels.
By using this approach, Walmart performed a full z/OS upgrade with zero downtime while the production environment was running.
CICS and CICSPlex SM maintenance and upgrade
As with the z/OS maintenance and upgrades being performed by a z/OS operations team, CICS and IBM CICSPlex SM maintenance and upgrades are likely to be performed by a CICS operations team, which might be the same as the service provider.
The same HA tools that you use for z/OS upgrades also apply here. You can take a rolling upgrade approach and upgrade each CICS region independently allowing work to be handled by other regions while a region is being upgraded and cycled.
Initial program load (IPL) of LPAR and sysplex
If you use HA tools, such as sysplex distributor in a multi-LPAR environment, you can re-IPL an LPAR while still allowing service consumers to use the services. During the restart of the LPAR, the workload fails over to the other LPARs in the sysplex. Be sure to shut down all CICS regions on the LPAR before you restart.
If it is required that you IPL the entire sysplex, the DNS forwarding configuration (as described in Chapter 5, “The z/OS systems programmer” on page 45) can failover to a backup sysplex if configured. It is unlikely that you must to complete this process as part of a planned outage.
6.5 Problem determination
This section describes how service providers can aid themselves, systems programmers, and operations teams in analyzing problems that can occur in a service environment.
As described in 6.4, “Maintenance” on page 63, all of the tools that you use for problem determination in a CICS environment still apply to problem determination in a services environment. The extra challenge with a multi-tenancy service environment is how to determine which consumer of the service is being affected by an issue, and resolving that issue for that consumer.
The key to overcoming this challenge (see in 6.3, “Metering and measurement” on page 61) is clear, consistent, and logical isolation of service consumers. Isolation of service consumers can be easily achieved through enforcing strict naming conventions.
6.5.1 Isolation of consumers in a multi-tenant environment
In designing isolation of service consumers, Walmart used the method of imbedding all CICS resource definitions with the same consistent ID, and by doing so, uniquely identifying the service consumer. This convention also extends to all z/OS resources that are associated with that consumer, such as including a unique ID in a data set qualifier or prefixing all DB2 table names with this identifier.
When Walmart services are provisioned, a new service instance is created that defines the following minimum CICS resources:
URIMAP
TRANSACTION
TRANCLASS
Other z/OS resources can also be created during this provisioning process; for example, the definition of a new VSAM Record Level Sharing (RLS) data set to hold data that is used by the service.
Figure 6-2 shows the CICS resource definitions that are automatically provisioned when a new service instance is requested. The VSAM data set H1.H2.H3.UC01FILE also is created.
EX G(UC01)
ENTER COMMANDS
NAME TYPE GROUP
UCO1VSAM FILE UCO1
TCLUCO1 TRANCLASS UCO1
UCO1 TRANSACTION UCO1
UCO1 URIMAP UCO1
Figure 6-2 Example CICS resource definitions
In this example, the service consumer is assigned the unique identifier UC01 after requesting a new service to be provisioned. This unique identifier is used in all resource definitions, and in the associated, dynamically created z/OS resources.
By using this naming convention, the service consumer or a subset of service consumers can be identified when a problem occurs in a multi-tenant environment, which can aid in problem determination.
6.5.2 Standards
Throughout all Walmart services, another key tool in problem determination from the service provider and service consumer perspectives is strict adherence to expected standards.
The primary example of adherence to standards within Walmart is for services that are accessed through HTTP. The HTTP standards for return codes and reasons should be adhered to as the standard specifies. The use of proper error codes (for example, the 4xx and 5xx error codes) can help to pinpoint where a problem might be in a system or if a consumer is using a service incorrectly.
Although not all 4xx and 5xx error codes are perfectly suited for CICS cloud services, an attempt should be made to use them as closely as possible. Another technique Walmart adopted is assigning specific error messages to fully qualify the type of HTTP status code. These messages are returned in the HTTP status text. For example, for a CICS service that accesses a VSAM file (if the file is closed), the service returns an HTTP status code 507 and an HTTP status text indicating that the file being accessed is closed. Although there is no standard HTTP status code for a file closed status, the 507 code does indicate insufficient storage, and the HTTP status text supplies an error message that indicates the file condition, file name, reason code, and the name of the program that detected the error.
6.5.3 Communication with other teams
Throughout this chapter and the book, great emphasis is placed on enabling good communications throughout your whole organization, from service consumer, through provider and systems programmers.
You can enhance this communication, and in particular aid in problem determination, by documenting what your services do, how to use them, and from a systems programming perspective, what the application structure and logic flow looks like, which enables greater collaboration.
Among other tools, Walmart uses wiki pages, blogs, and documentation to aid in this collaboration across their teams. Each team can access the shared social resources and contributes to them.
6.6 Summary
This chapter highlights some of the operational considerations of hosting cloud services in CICS. From the monitoring, measurement, scalability, and problem determination perspectives, the Walmart example shows that true multi-tenant isolation that uses standard tools available in CICS can offer an excellent environment to host cloud services.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset