Chapter 6. Automatic restart management

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Automatic restart management

Automatic restart management (ARM) is key to automating the restarting of subsystems and applications (referred to collectively as applications) so they can recover work they were doing at the time of an application or system failure and release resources, such as locks, that they were holding. With the automatic restart management policy, you can optionally control the way restarts are done. You can use the policy defaults that IBM provides, or you can override them.

In a sysplex environment, a program can enhance its recovery potential by registering as an element of automatic restart management. Automatic restart management can reduce the impact of an unexpected error to an element because z/OS can restart it automatically, without operator intervention. In general, z/OS restarts an element when:

•The element itself fails. In this case, z/OS restarts the element on the same system.

•The system on which the element was running unexpectedly fails or leaves the sysplex. In this case, z/OS restarts the element on another system in the sysplex; this is called a cross-system restart.

In general, your installation can use automatic restart management in two ways:

•To control the restarts of applications (such as CICS) that already use automatic restart management as part of their recovery.

•To write or modify installation applications to use automatic restart management as part of recovery.

During planning before installation of the z/OS systems in the sysplex, you can define a sysplex failure management (SFM) policy, and an automatic restart management policy. The goals of SFM and automatic restart management are complementary. SFM keeps the sysplex up and running; automatic restart management keeps specific work in the sysplex (batch jobs or started tasks) up and running. SFM manages system-level problems and partitions systems out of the sysplex in response to those problems. Automatic restart management manages subsystems and restarts them appropriately after either subsystem-level or system-level failures occur.

6.1 Automatic restart management

Figure 6-1 Automatic restart management

z/OS recovery function

Automatic restart management (ARM) is an z/OS recovery function that can improve the availability of specific batch jobs or started tasks. When a job or task fails, or the system on which it is running fails, ARM can restart the job or task without operator intervention.

The goals of sysplex failure management (SFM) and ARM are complementary. While SFM keeps the sysplex running, ARM keeps specific work in the sysplex running. If a job or task fails, ARM restarts it on the same system it was running at the time of failure. If a system fails, ARM restarts the work on other systems in the sysplex; this is called a cross-system restart.

A program cannot use both ARM and checkpoint/restart. If a program using checkpoint/restart tries to register with ARM services, the request is rejected.

The purpose of ARM is to provide fast, efficient restarts for critical applications when they fail. ARM is an z/OS recovery function that improves the time required to restart an application by automatically restarting the batch job or started task (STC) when it unexpectedly terminates. These unexpected outages may be the result of an abend, system failure, or the removal of a system from the sysplex.

A primary availability requirement for all system environments is to reduce the duration and impact of an outage. The sysplex environment helps to achieve this objective through the following recovery strategies:

•Detection and isolation of failures - detect a failure and route work around it.

•Recovery of resources - release critical resources as soon as possible so that other programs and systems can acquire them. This reduces contention and speeds up the recovery or restart of work affected by the failure.

•Recover lost capacity - restart or resume processing that was affected by the failure. For example, in a database environment, the DB manager should be restarted quickly so that its log recovery can be processed.

Used by transaction and resource managers

The intended exploiters of the ARM function are the jobs and STCs of certain strategic transaction and resource managers. Some of these are:

•CICS/ESA

•CP/SM

•DB2

•IMS/TM

•IMS/DBCTL

•ACF/VTAM

In general, an installation can use automatic restart management in two ways:

•To control the restarts of applications (such as CICS) that already use automatic restart management as part of their recovery

•To write or modify installation applications to use automatic restart management as part of recovery

For more information about automatic restart management, see z/OS MVS Setting Up a Sysplex, SA22-7625, z/OS MVS Sysplex Services Guide, SA22-7617, and z/OS Parallel Sysplex Services Reference, SA22-7618.

6.2 ARM environment

Figure 6-2 ARM environment

ARM environment

In a sysplex environment, a program can enhance its recovery potential by registering as an element of automatic restart management. An automatic restart management element represents a program or an application that:

•Is submitted as a job.

•Is submitted as a started task.

•Is an abstract resource. An abstract resource is a program or a set of programs that is only associated with (or has a bind to) the system on which it is running.

Automatic restart management can reduce the impact of an unexpected error to an element because MVS can restart it automatically, without operator intervention. In general, MVS restarts an element when:

•The element itself fails. In this case, MVS restarts the element on the same system.

•The system on which the element was running unexpectedly fails or leaves the sysplex. In this case, MVS restarts the element on another system in the sysplex; this is called a cross-system restart.

ARM policy

In addition to the policy, three exit points are provided where subsystem and installation-written exit routines can be invoked to cancel a restart or influence how it is done. This gives you extended control and can simplify or enhance the policy that you build and activate. The exits are:

•The workload restart exit (IXC_WORK_RESTART)

•The element restart exit (IXC_ELEM_RESTART)

•The event exit

A system must be connected to an ARM couple data set with an active automatic restart management policy.

ARM couple data set

Automatic restart management allows different levels or functional versions of ARM couple data sets to be defined. Automatic restart management is a function that runs in the cross-system coupling facility (XCF) address space and maintains its own data spaces. ARM requires a primary couple data set to contain policy information as well as status information for registered elements. It supports both JES2 and JES3 environments.

The following couple data set format levels or functional versions can be created using the IXCL1DSU utility, as follows:

•Base format level. This is the initial or base ARM couple data set format level and is created when the ARM couple data set is formatted using a version of IXCL1DSU.

•This format level is created when the ARM couple data set is formatted using a version of IXCL1DSU.

Note: Automatic restarts do not need to be enabled at all times. For example, you might not want automatic restart management enabled for restarts unless certain jobs are running, or during off-shift hours. However, even while automatic restarts are disabled, elements can register with automatic restart management as long as the system is connected to an ARM couple data set.

6.3 Create an ARM couple data set and policy

Figure 6-3 Create ARM couple data set and policy

Create ARM couple data set and policy

Figure 6-3 shows the key activities necessary to create an ARM couple data set and an ARM policy:

•Create a primary and alternate couple data set:
Create and submit a ARM couple data set formatting job (IXCL1DSU).

•Add the primary couple data set to the sysplex:
Use the command SETXCF COUPLE,TYPE=ARM,PCOUPLE=(SYS1.XCF.ARM10)

•Add the alternate couple data set to the sysplex:
Use the command SETXCF COUPLE,TYPE=ARM,ACOUPLE=(SYS1.XCF.ARM20)

•Don’t forget to update the COUPLExx parmlib member with the ARM CDS names:

Figure 6-4 COUPLExx parmlib member example

JCL for ARM couple data sets

There is a sample job in SYS1.SAMPLIB(IXCARMF) to format a primary and alternate couple data set for automatic restart manager. In this sample JCL, two couple data sets, SYS1.MIGLIB.ARMCPL01 and SYS1.MIGLIB.ARMCPL02, will be allocated. If the size of these couple data sets is different, the larger of the two should always be used as an alternate couple data set.

Define an ARM policy

The ARM policy is defined by using the IXCARM macro, and is used as follows:

•If started tasks or jobs need a special restarting policy, then an ARM policy should be defined.

•Start the defined policy with this command:

SETXCF START,POLICY,TYPE=ARM,POLNAME=name

The policy can be user-written or the IBM-supplied default policy.

6.4 ARM restarts

Figure 6-5 ARM restarts

ARM restarts

The purpose of automatic restart management is to provide fast, efficient restarts for critical applications when they fail. ARM is a z/OS recovery function that improves the time required to restart an application by automatically restarting the batch job or started task (STC) when it unexpectedly terminates. These unexpected outages may be the result of an abend, system failure, or the removal of a system from the sysplex.

The need for restarts is detected by either XCF, recovery termination management (RTM), or the initiator.

ARM interfaces with MCS console support and JES to provide its recovery mechanisms.

Workload Manager (WLM) provides statistics on the remaining processing capacity in the sysplex.

ARM minimizes failure impact by doing fast restarts without operator intervention in the process. Related groups of programs can be restarted and if programs are start-order dependent, it assures that the order dependency is honored. The chance of work being restarted on systems lacking the necessary capacity is reduced as the best system is chosen, based on the statistics returned to ARM by the Workload Manager.

6.5 Modifying batch and STC jobs

Figure 6-6 Modifying batch and STC jobs

Modifying batch and STC jobs

To use ARM, jobs must be modified to register with ARM using the IXCARM macro services. In a sysplex environment, a program can enhance its recovery potential by registering as an element of automatic restart management. Automatic restart management can reduce the impact of an unexpected error to an element because z/OS can restart it automatically, without operator intervention. The automatic restart management service routine is given control from the IXCARM macro and is used to:

READY Mark a user of automatic restart management services as ready to accept work.

DEREGISTER Deregister a user of automatic restart management services.

ASSOCIATE Associate a user of automatic restart management services with another user for takeover or restart purposes. This allows a program to identify itself as the backup program for another element of the automatic restart management. This identification tells z/OS that the other element should not be restarted unless this backup program is deregistered.

WAITPRED Wait until predecessor elements have been restarted if applicable. This indicates that z/OS should delay the restart for this program until z/OS completes the restart of a related program, which is called a predecessor element.

6.6 Registering with ARM

Figure 6-7 Registering with ARM

Registering with ARM

To make batch jobs or started tasks ARM restartable, they must register with ARM using an authorized macro service, IXCARM. The registration requires the specification of an element name. This element name must be unique in the sysplex and is up to sixteen characters long. The three basic ARM services that a job must use are as follows:

REGISTER Early in the initialization process for a job, a program that wants to use ARM for restart must issue the IXCARM REQUEST=REGISTER macro. Part of the macro parameters is the element name.

Optionally, you can specify restart parameters and an event exit.

After the job or task has issued the REGISTER, z/OS can automatically restart the program when an unexpected failure occurs. You can specify restart parameters on the REGISTER request; however, restart parameters in an user written ARM policy override IXCARM macro definitions.

READY When a program that has issued the REGISTER request has completed initialization and is ready to run, an IXCARM REQUEST=READY must be issued.

DEREGISTER Before a job completes its normal termination processing, the program must issue an IXCARM REQUEST=DEREGISTER to remove the element and all ARM restart possibilities.
You can deregister from automatic restart management when the job or task no longer needs to be restarted. If a program fails after it deregisters, z/OS will not restart the program.

6.7 ARM element states

Figure 6-8 ARM element states

ARM element states

Figure 6-8 shows an element’s use of ARM services and how they affect the state of that element.

ARM puts each element into one of following states to identify the last interaction:

Starting From the initial registration (IXCARM-Register) of a program as an ARM element until ready (IXCARM-Ready).

Available From the time the element becomes ready (IXCARM-Ready) until the element either deregisters from ARM or terminates.

Failed From ARM detected termination of an element or termination of the system, until ARM initiates a restart of the element.

Restarting From ARM’s initiation of a restart until the subsequent reregistration of the element.

Recovering From the element’s registration after an ARM restart to its subsequent issuing of IXCARM-Ready.

WAITPRED By issuing IXCARM with the WAITPRED parameter, an element indicates that a predecessor element must become ready before this element can initialize successfully. During restarts, not initial starts, MVS will wait for the predecessor to issue IXCARM REQUEST=READY before allowing this element to complete ready processing. Issuing WAITPRED is most useful when an element and its predecessor are in the same restart group, by specific assignment or by default. Elements should issue WAITPRED after the register request, but before the ready request.

Element processing

Initially, an element is unknown to ARM. The first interaction with ARM occurs when the program makes itself known to ARM by invoking the register service and specifying the element name under which it is to be registered. ARM sets the state of the element to starting.

When the element is ready to accept new work, it invokes the ready service, and ARM sets the state of the element to available. Once the element’s state is available, there are no more interactions with ARM until the element terminates. Before terminating normally, the element invokes the deregister service, which again makes the element unknown to ARM.

Element termination

When ARM detects that the element has unexpectedly terminated, that is, the program has terminated without invoking the deregister service, or that the system on which the element had been running has left the sysplex, ARM sets the element’s state to failed as part of its processing to restart the element. An element will be in the failed state for a very brief interval.

Element restart

When ARM restarts the element, it sets the state to restarting. When a restarting element subsequently invokes the register service, ARM sets the state to recovering. Once the element is prepared to accept new work, it notifies ARM by invoking the ready service, which causes ARM to set the element’s state to available.

Command for ARM status

The state of a given element named DB2$DB2H will be part of the information provided when ARM status is requested with the DISPLAY XCF,ARMSTATUS command for one or more elements.

Figure 6-9 Example of DISPLAY XCF,ARMSTATUS output

6.8 ARM restart methods

Figure 6-10 ARM restart methods

ARM restart methods

ARM specifies under which conditions MVS should restart this element. ARM has three restart types, ALLTERM, ELEMTERM, and SYSTERM, as follows:

ALLTERM Indicates that the element should be restarted for all unexpected failures as appropriate. The value specified on an IXCARM REQUEST=REGISTER request for the ELEMBIND keyword determines what types of failures are appropriate.

ELEMTERM Element termination
When an element unexpectedly fails, ARM is given control during end-of-job and end-of-memory termination after all other recovery has taken place. If the job or task terminating is an element of the automatic restart management and should be restarted, then z/OS:

a. Gives control to the element-restart exit.

b. Gives control to the event exit.

c. Restarts the element.

d. Issues an ENF signal when the element re-registers with the automatic restart management.

SYSTERM System termination
When a system unexpectedly fails, ARM determines whether any elements were running on this system. If those elements can be restarted on another system, and cross-system restarts are allowed, z/OS does the following for each system on which the elements will be restarted:

a. Gives control to the workload restart exit.

b. For each element that will be restarted:

i. Gives control to the element restart exit.

ii. Gives control to the event exit.

iii. Restarts the element.

iv. Issues an ENF signal when the element reregisters with the automatic restart management.

RESTART_METHOD in ARM policy

The ARM policy contains entries for elements that define what type of restart will be performed for the element. In the ARM policy, the RESTART_METHOD(event,restart-type) is optional. The default is RESTART_METHOD(BOTH,PERSIST). For started tasks only, the IXCARM macro parameter that overrides the default specification is STARTTXT.

The RESTART_METHOD has three event options:

ELEMTERM Indicates that the persistent JCL or command text is to be overridden by the JCL data set or the command text specified in restart-type only when the element itself terminates.

SYSTERM Indicates that the persistent JCL or command text is to be overridden by the JCL data set or the command text specified in restart-type only when the system the element was running on terminates.

BOTH Indicates that the persistent JCL or command text is to be overridden by the JCL data set or the command text specified in restart-type when either the system the element was running on terminates, or when the element itself terminates.

Three restart-type options:

STC,'command-text' Tells ARM to restart this element using the command provided with 'command-text'.

PERSIST Indicates that z/OS is to use the JCL or command text that previously started the element.

JOB,'jcl-source' Indicates that z/OS is to restart the element as a batch job. 'jcl-source' is the name of the data set that contains the JCL for restarting the element. This data set name must be enclosed within single quotes. The data set must have the same data set characteristics (for instance, LRECL) as standard procedure libraries.

Restarting an element

When ARM restarts an element that terminated with an ELEMTERM, that element is restarted on the same system where it was previously active. Even if the initiators of the system all are busy or stopped, JES tries to restart the job on that system. This is done by setting the system affinity of the restarting job to the same system.

A job that is being ARM-restarted when it fails with an ELEMTERM termination can never execute on a different system. This is because the job would try to register with ARM using the same element name on a different system than the restart system. This is not allowed, and the following return code and reason code are issued by ARM: (RC0008, RSN0150 of IXCARM REQUEST=REGISTER).

6.9 Restart on the same system

Figure 6-11 Restart on the same system

Restart on the same system

When an element ABENDS or otherwise terminates while registered with ARM, ARM restarts that element on the same system on which it had been running. Such ARM restarts are referred to as restarts in place.

You should be aware that ARM restart processing does not include any kind of cleanup or recovery. That is considered internal to the element being restarted. Element-related processing, such as the releasing of local resources or the analysis of the impact of the termination to on-going work, is the responsibility of recovery routines of the element. Figure 6-11 shows the various factors involved in restarting work on the same system.

ARM EOM processing

ARM’s end-of-memory (EOM) resource manager determines if a terminating address space is a registered element. If so, it schedules the restarting of that element. The elements restarted from ARM’s EOM resource manager are started tasks. Restarts for batch jobs are performed in essentially the same way from the ARM end-of-job (EOJ) resource manager.

ARM restart

If the element to be restarted specified an event exit routine when it registered as an ARM element, then ARM calls the specified exit routine before restarting that element. This processing could instruct ARM not to restart the element. If there are any element restart exits defined, ARM also invokes them. An element restart exit routine can cancel the restart of a given element or change the method by which ARM restarts it.

The method by which ARM restarts an element depends on whether the element is a started task or a batch job. You can specify the restart method through either the ARM policy or an exit routine. A registering program can also specify a command (through the REGISTER macro) to be used for ARM restarts.

ARM exit routine for restarts

If an installation policy or an exit routine provides an override JCL data set or command text to restart an element, ARM determines if the data set name or command text contains any symbolic substitution parameters. If it does, these parameters are resolved before the command is issued or the JCL submitted. This resolution of symbolic substitution parameters is done using the values from the system on which the element initially registered. For restarts in place, ARM’s replacement of these values provides the same result as would occur if the normal z/OS resolution had been done.

After initiating the restart, ARM loses contact with the element and cannot track how the restart is processing. ARM establishes a time interval to wait for the subsequent registration of the element from its initialization process. If the time interval expires before registration is received, then a restart time-out of the element is recognized. An error message is produced, an ENF is issued indicating that ARM is no longer handling restarts for the element, and all resources related to the element are cleaned up. Lastly, ARM deregisters the element. If the element was just delayed and registers after the restart time-out, then ARM will treat this as a new registration.

Once registered, the time-out interval is set for the element to become ready. If this expires, then a warning message is produced, but no other actions occur. The element would remain in the RECOVERING state. Finally, when the element requests the READY service, the state is set to AVAILABLE, and restart processing is considered complete.

6.10 Restart on different systems

Figure 6-12 Restart on different systems

Restart on different systems

Figure 6-12 shows the factors involved in restarting work on the same system. When a system fails or otherwise leaves the sysplex, ARM has the overall responsibility of restarting registered elements that had been running on the failed system, on the remaining systems in the sysplex. ARM depends on several other components to achieve this goal.

Operator commands for ARM restart

In order to perform restarts, ARM must issue operator commands from the XCF address space (XCFAS). You must be certain that the XCFAS has enough authority to issue any commands required to restart a failed element.

Note: The XCF address space must have adequate authorization through the z/OS Security Server (RACF) or an equivalent security product.

ARM restart actions

The first action taken is the detection and the initial termination of the failed system. Following is a summary of the actions performed by other components:

1. The detecting z/OS XCF issues a status update missing event because the XCF failure detection interval has elapsed.

2. The XCF indeterminate status interval expires. Available options are:

– Operator prompt

– Reset after isolation

– LPAR system reset or deactivation via ARF (PR/SM only)

3. The failed system is removed from the sysplex (sysplex partitioning) by:

– Logical partition deactivation and acquiring storage resources via ARF (PR/SM only)

– Multisystem XCF exploiters recovering resources owned by the failed system:

• Console services - release console resources owned by the failed system

• GRS - release global ENQs owned by the failed system

Processing on remaining systems

All remaining systems are notified of the system’s termination through the XCF group exit. XCF, on each z/OS image in the sysplex, directly notifies ARM of the system’s termination. The ARM processing that initiates restart processing runs on the z/OS image that first detected the system termination. The ARM processing on this system has several responsibilities:

•Determine if ARM’s threshold for system terminations prohibits its restarting of the element that had been running on the failed system

•Determine if any registered elements were executing on the failed system

•Determine which elements can be restarted on another system

•Determine which elements must be restarted on the same system (same restart group)

•Determine the system on which to restart each restart group

Sysplex analysis

When a system leaves the sysplex, ARM determines if there have been two or more other SYSGONE events within the last ten minutes. If so, ARM does not restart any of the elements from the system that experienced this latest SYSGONE event. ARM issues a message for this event. This is done in case there is some problem in ARM or in its elements that causes the system failure. If left unchecked, such a problem could cause ARM restarts to cascade the failure to many or all systems in the sysplex.

ARM restart of a different system

ARM attempts to restart the elements from a system that left the sysplex on the remaining systems that have the most available capacity. The steps that ARM follows to identify and prioritize the candidate systems are:

•Determine the relative available capacity and available CSA/ECSA. ARM builds a list of these systems, sorted according to the available capacity.

•For each restart group that is to be restarted, ARM will then:

– Identify the systems that are in the same JES XCF group as the system that has terminated. Systems not in the same JES XCF group are deleted from the list of systems just created.

– Determine if the installation’s policy specifies either TARGET_SYSTEM or FREE_CSA for the restart group, and if so, eliminate from the list any system that does not meet the criteria defined by these parameters.

The top system on the list is then assigned to restart this restart group. If there are no candidate systems left after these checks, ARM aborts the restart of that restart group.

Workload restart exit

On each system where restarts are to occur, ARM calls your workload restart exit if it is installed. This allows the installation to do things such as cancel work that is presently running on that system to prepare for the elements that ARM is about to restart. ARM then initiates restart processing for the elements within the restart group. They are essentially started at the same time to allow for maximum parallelism during restart processing.

6.11 Group restart overview

Figure 6-13 Group restart overview

Group restart overview

This example of a group restart shows how DB2 and CICS coordinate their restarts. Both DB2 and CICS can be started at the same time, including AORs and TORs. Since they are at different levels, as defined in the ARM policy, they must wait for higher levels to complete their initialization before they can continue, so they issue a WAITPRED to wait for a higher level to issue a READY.

When the ready status of a higher level is given, the next level can continue its initialization.

RESTART_ORDER processing

Specifies the order in which elements in the same restart group are to become ready after they are restarted, as shown in Figure 6-13:

RESTART_ORDER The RESTART_ORDER parameter applies to all restart groups. RESTART_ORDER is optional. The default values for RESTART_ORDER are:

•LEVEL(0) - ELEMENT_TYPEs: DB2, DBCTL

•LEVEL(1) - ELEMENT_TYPEs: AOR-1, AOR-2

•LEVEL(2) - ELEMENT_TYPEs: TOR-1, TOR-2

LEVEL(level) Specifies the level associated with elements that must be restarted in a particular order. The elements are restarted from the lowest level to the highest level. The LEVEL keyword can be specified multiple times.

The set of elements associated with a particular level is identified by the ELEMENT_NAME or ELEMENT_TYPE parameters. ELEMENT_NAME or ELEMENT_TYPE must be specified with each LEVEL specification. (You can specify both ELEMENT_NAME and ELEMENT_TYPE.)

Level must be a decimal number from 0 through 65535.

Group restart considerations

Determine which batch jobs and started tasks will be using automatic restart management for recovery purposes. For the IBM products that use automatic restart management, read their documentation for any policy considerations. Here are the parameters that can be used for the restart processing:

RESTART_GROUP Determine whether any of these elements is interdependent, that is, needs to run on the same system.

Note: Any elements that are not explicitly assigned to a restart group become part of the restart group named DEFAULT. Thus, if these elements are restarted, they are restarted on the same system.

RESTART_ORDER Determine whether there is an order in which MVS should restart these elements. That is, are any elements in the restart group dependent upon other elements being restarted and ready first?

RESTART_PACING Determine whether the elements in a restart group need to be restarted at specific intervals.

TERMTYPE Determine whether the element should be restarted when only the element fails, or when either the element or the system fails.

RESTART_METHOD Determine whether specific JCL or command text is required to restart an element.

FREE_CSA Determine whether a minimum amount of CSA/ECSA is needed on the system where the elements in a restart group are to be restarted.

TARGET_SYSTEM Determine whether you want the elements in the restart group to be restarted on a specific system.

6.12 ARM exit facilities

Figure 6-14 ARM exit facilities

ARM exit facilities

The exits shown in Figure 6-14 are described in this topic.

The workload restart exit (IXC_WORK_RESTART)

Through the IXC_WORK_RESTART exit, your installation can prepare a system to receive additional workload from a failing system in the sysplex. MVS invokes IXC_WORK_RESTART one time on each system that is selected to restart work from a failing system. MVS selects the system most capable of handling the additional work. Because of the system’s resources or unusual workload, your installation might want to improve this system’s capability. Your installation can do so by coding the workload restart exit to perform tasks such as cancelling lower priority work.

This exit cannot cancel or change the restart of an element. This exit is executed once on each system where ARM is about to restart elements that had been running on a system that terminated. You might provide a routine for this exit to cancel low-priority work in preparation for the additional workload from the failed system that ARM is about to restart.

The element restart exit (IXC_ELEM_RESTART)

Through the IXC_ELEM_RESTART exit, your installation can modify or cancel the automatic restart management initiated restart of an element. Your installation can use this exit to coordinate the restart of an element with other automation routines, and to make decisions about how, or if, it will be restarted. ARM invokes this exit once for each element that is to be restarted, on the system where it will be restarted.

This exit is executed before ARM restarts an element. It is used to coordinate restarts with the automation packages you may have installed.

The event exit

This exit is usually provided by a program that is registering to become an ARM element. ARM invokes this exit for certain events that affect the element. For example, you could use the ENFREQ macro to listen for the deregister ENF signal for an element, then restart the element. ARM issues an asynchronous ENF signal, event-code 38, to communicate the element type to listeners of the automatic restart management ENF Code 38 for events that pertain to this element.

6.13 ARM SMF records

Figure 6-15 ARM SMF records

ARM SMF records

The Type 30 (common address space work) and Type 90 (system status) SMF records have been enhanced to provide information relating to ARM activity, elements, and policy management.

The SMF Type 30 record has been altered in these areas:

•Header section

•Product section

•ARM section

SMF record type 30

The new ARM section contains element identification information and time stamps for when various events complete. If the time stamp for an event is zero, it usually indicates that the event failed. This new section is created each time an element:

•Is started or restarted - The element issued an IXCARM REGISTER request.

•Is explicitly waiting for other elements to become functional - The element issued an IXCARM WAITPRED request.

•Is functional - The element issued an IXCARM READY request.

•Is no longer functional - The element issued an IXCARM DEREGISTER request during normal termination.

As each record is generated, the ARM time stamps that have been collected to that point will be included in the record. If a started task or job running in an address space has not requested an IXCARM REGISTER, then there will be no ARM section in the record. However, once the IXCARM REGISTER is requested, the ARM section will appear in every future Type 30 record that is generated for that started task or job.

SMF record type 90

The SMF Type 90 record helps to assist in tracking the use of ARM policies. ARM generates a Type 90, subtype 27, 28, 29, or 30 record when one of the following events occur:

•An ARM policy is defined (new or changed) on the couple data set

•An ARM policy is deleted from the couple data set

•ARM is enabled for restarts via the SETXCF START (and a policy may be activated or changed)

•ARM is disabled for restarts via the SETXCF STOP (and a policy may be deactivated)

SMF record type 90 contents

The contents of the SMF Type 90 records include:

•A function code identifying the event related to this record

•The name of the ARM policy that was affected:

– The policy, if any, that was activated via a SETXCF START command

– The policy, if any, that was deactivated via a SETXCF STOP command

– The policy that was either defined (new or changed) or deleted from the ARM couple data set

•The name of the user who performed this function

ARM creates an SMF Record Type 90 (with new subtypes) to provide an audit trail when ARM restarts are enabled or disabled or a new policy is activated. This enables you to determine what ARM policy was active across a given interval if a question arises about an ARM action during that interval.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6. Automatic restart management

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 6. Automatic restart management