Recovery concepts
Time synchronization supports two types of network configurations (Coordinated Timing Network (CTN)):
Mixed CTN
STP-only CTN
A Mixed CTN allows the use of both Sysplex Timer signals and STP messages. An STP-only CTN uses only STP messages to keep servers synchronized.
In this chapter we present a high-level overview of STP recovery concepts and definitions, and how they apply to each type of CTN. Later in this chapter we describe how recovery is achieved for different types of failure.
For a complete description of the concepts refer to Server Time Protocol Recovery Guide, SG24-7380.
4.1 Terminology overview
This section provides a review of STP terminology:
Sysplex Timer offline sequence
 – Sysplex Timer detects a failure. The failing Sysplex Timer transmits an offline sequence symbol on the Control Link Oscillator (CLO) links to signal the other Sysplex Timer in an expanded availability configuration that it is going offline.
 – If Sysplex Timers in an expanded availability configuration lose the capability to synchronize:
 • The primary Sysplex Timer continues to transmit ETR signals whether or not an offline sequence is received.
 • The secondary Sysplex Timer becomes the primary Sysplex Timer if it receives an offline sequence.
 • The secondary Sysplex Timer discontinues transmission of ETR signals if it does not receive an offline sequence.
Synchronization check threshold
 – The server or coupling facility (CF) is considered to be in synchronized state if its time of day (TOD) clock is within the synchronization check threshold of the Coordinated Server Time (CST).
 – The STP synchronization check threshold is 50 microseconds. If a server’s TOD clock differs from the Coordinated Server Time by more than +/- 50 microseconds, the server or CF can become unsynchronized and therefore becomes a stratum 0.
 – A synchronization check can be recoverable when it happens and STP can re-establish synchronization.
4.2 Freewheel interval
The freewheel interval is the amount of time that a stratum 2 or stratum 3 server can remain synchronized without receiving messages from its clock source. It is approximately 1 second for a Mixed CTN or 10 seconds for an STP-only CTN.
Freewheel only occurs if a server has connectivity to at least one potential clock source, stratum 1 or 2, but valid messages are not being received from the clock source. However, if a server loses connectivity to all potential clock sources (link failures, for example), it immediately initiates recovery.
For stratum 2 and stratum 3 servers, the freewheel interval is entered when STP timing messages from the selected time source are not received. If no alternative time source can be found, the server will become unsynchronized at the end of the freewheel interval and switch to stratum 0. If an alternative time source is available, a switch to a different stratum level might be required in order to receive STP timing messages.
4.3 Server offline signal
An offline signal (OLS) is transmitted by the server to indicate that the channel is going offline. This is independent of STP. Conditions when an OLS is issued by the server include:
Server or LPAR dump
Server power off
CHPID configure off
Channel failure (channel checkstop)
An offline signal might not be transmitted for certain failures, such as:
Channel subsystem failure
System Assist Processor (SAP) recovery
Site or server power outage
Link failure
In an STP-only CTN, when a BTS is configured but an Arbiter is not, STP recovery checks whether offline signals have been transmitted on the initialized coupling links between the PTS and the BTS. Furthermore, with STP Version 2, OLSs are considered only if multiple links that went down in the last two seconds have received the offline signal indication:
If the BTS receives OLS on multiple links, including the last link to the CTS within a two-second interval, the BTS will take over as CTS.
If the CTS has sent OLS on multiple links, including the last link to the BTS within a two-second interval, the CTS will release its CTS role.
If, when the last link goes down, all other links went down more than two seconds ago, then the link failure is considered a single link case and OLS rules cited previously do not apply. Also, STP recovery does not use OLS in a CTN where an Arbiter is configured.
When a BTS is configured but an Arbiter is not, STP recovery uses offline signals in conjunction with console-assisted recovery to make the final determination as to whether the BTS initiates a take over as the CTS. See 4.6, “Console-assisted recovery” on page 156.
4.4 Going Away Signal
The Going Away Signal (GAS) is a reliable unambiguous signal to indicate that the CPC is about to enter a check stopped state. When a GAS from the CTS is received by the BTS, it safely takes over as CTS.
GAS has priority over OLS in a CTN where an Arbiter has not been assigned. The BTS can also use GAS to take over as CTS for CTNs with an Arbiter assigned without communicating with the Arbiter. This is in contrast to OLS, where OLS is ignored for CTNs with an Arbiter assigned.
GAS removes the dependency on OLS and CAR in a CTN without an Arbiter assigned and the dependency on BTS to Arbiter communication for CTNs with an Arbiter assigned.
GAS is sent on InfiniBand (IFB) links using HCA3-O to HCA3-O - 12x IFB or 12x IFB3 or HCA3-O LR to HCA3-O LR - 1x IFB for z196 GA2 and later machines.
The current recovery design is still used when GAS is not received by BTS and for other failure types.
4.5 Arbiter-assisted recovery
Arbiter-assisted recovery is applicable when both a BTS and an Arbiter are assigned. The BTS does not invoke OLS rules because the Arbiter provides additional means to determine whether the BTS can take over.
If the BTS loses communication on all of its established paths to the CTS, it attempts to determine the status of the CTS through the Arbiter:
If both the BTS and the Arbiter cannot communicate with the CTS, then the BTS takes over as the CTS and becomes the Stratum 1.
If the CTS is still alive after losing communication with both the Arbiter and the BTS, it will switch to Stratum 0 so that the CTN does not end up with two Stratum 1 servers.
If the Arbiter can communicate with the CTS, then the BTS will not take over, but instead transition to Stratum 3 and get its timing signals from a Stratum 2 server, for example the Arbiter. The Arbiter takes note that the BTS no longer has connectivity to the CTS and, should it subsequently lose contact with the CTS, the Arbiter will inform the BTS accordingly, causing the BTS to proceed with taking over as the CTS.
If the BTS is unable to communicate with the Arbiter due to connectivity failure, console-assisted recovery is invoked by the BTS as an alternate method for determining the status of the CTS.
 
Note: In a two-site CTN, the location of the Arbiter is critical: it will determine which site remains operational after a loss of communication between the PTS and BTS. Arbiter location considerations are discussed in Server Time Protocol Recovery Guide, SG24-7380, Section 2.4.8, “Two-site considerations”.
Blocking disruptive actions on STP role servers
Disruptive actions such as POR are blocked for the CTS. New function has now been added to block disruptive actions on any of the STP role servers, PTS, BTS, or Arbiter. This prevents a disruptive action causing the CTS to give up the S1 role and go S0. For example, if the PTS has the CTS role and then the Arbiter (or BTS) has a planned or unplanned outage, a disruptive action on the BTS (or Arbiter) causes the CTS to give up the S1 role and go S0 as it loses communication to both the BTS and Arbiter.
This new function was introduced with the MCL levels shown in Table 4-1.
Table 4-1 Blocking disruptive actions
Driver/Server
MCL
Bundle
Release Date
D86E / z196
N29809.277
45
Sept 8, 2011
D86E / z196
N29802.420
45
Sept 8, 2011
D79F / z10
N24415.078
50
Sept 28, 2011
D79F / z10
N24409.184
50
Sept 28, 2011
D93G / z114 and z196 GA2
Integrated
N/A
Sept 9, 2011
Arbiter-assisted recovery enhancements
The current STP Arbiter-assisted recovery design handles recovery of single failures in an STP-only CTN. Enhancements have now been made to handle planned and unplanned actions that could affect two of the three STP role servers.
This includes safeguarding against the following potential hazards, which could result in a CTN-wide failure:
Planned disruptive actions on the BTS and Arbiter in parallel as part of the same task. Note that disruptive actions on any of the STP role servers will be blocked via the enhancement described previously.
Unplanned failure of a second STP role server when the STP role for the first unplanned failure is not reassigned or removed.
Failure to remove the STP role of a server being upgraded to a new machine type. If the STP role is not removed prior to the upgrade, the new node descriptor prevents the server from reassuming the STP role. This puts the CTN in a state equivalent to that when the server with that STP role has an unplanned outage.
Arbiter-assisted recovery is enhanced so that a degraded state is entered when any two of the three STP role servers (PTS, BTS or Arbiter) agree that they cannot communicate with the third STP role server. A degraded state is entered when:
The BTS and Arbiter can communicate but cannot communicate with the PTS/CTS. The BTS will take over as the CTS and then Arbiter-assisted recovery is disabled.
The PTS and BTS can communicate but cannot communicate with the Arbiter.
The PTS and Arbiter can communicate but cannot communicate with the BTS.
This new function was introduced with the MCL levels shown in Table 4-2.
Table 4-2 Arbiter-assisted recovery enhancements
Driver/Server
MCL
Bundle
Release date
D79F / z10
N24406.094
50
Sept 28, 2011
D86E / z196
N29799.110
44
Aug 24, 2011
D93G / z114 and z196 GA2
Integrated
N/A
Sept 9, 2011
Note: The PTS, BTS, and Arbiter servers must all be at the required MCL level for Arbiter-assisted recovery to be disabled.
Disabling Arbiter-assisted recovery provides safeguards against:
Planned disruptive actions done sequentially on the BTS or PTS (whichever is not the CTS) and the Arbiter
Double failures, unplanned or a combination of planned and unplanned, of the BTS or PTS (whichever is not the CTS) and the Arbiter
It does not provide safeguards against planned disruptive actions initiated as part of the same HMC task on both the PTS or BTS (whichever is not the CTS) and the Arbiter.
While Arbiter-assisted recovery is disabled:
The BTS cannot take over as CTS using Arbiter-assisted recovery.
The CTS will not surrender its role when it loses attachment to the remaining special role server.
The BTS can still take over as CTS using either Console-assisted recovery (CAR) or the STP Going Away Signal (GAS) transmitted from the CTS.
 
Note: After Arbiter-assisted recovery has been disabled it will not be reenabled until there is full connectivity between the PTS, BTS, and Arbiter.
4.6 Console-assisted recovery
Console-assisted recovery uses the HMC in an attempt to determine the status of the PTS (when initiated by the BTS) or the status of the BTS (when initiated by the PTS). Console-assisted recovery helps to determine whether the BTS can take over as CTS, or the PTS can take back its role as a CTS.
4.6.1 Console-assisted recovery in a CTN with BTS
In a CTN that does not have an Arbiter configured, console-assisted recovery is used to determine whether the BTS should take over the CTS role. The BTS initiates console-assisted recovery when the BTS has lost communication with the CTS.
If the failure has already been detected through OLS, the BTS has taken over the CTS role, and console-assisted recovery is used to confirm that the PTS has failed.
If the CTS failure has not yet been recognized through OLS, for example, because the failure involved a single link, the BTS takes over if console-assisted recovery confirms that the CTS has failed.
When it initiates console-assisted recovery, the BTS sends a command to its Support Element (SE) to determine the state of the CTS by communicating through the HMC.
If the response indicates that the CTS has failed, the BTS can take over as the new CTS.
If the response indicates that the status of the CTS is either good or indeterminate, the BTS cannot take over as the new CTS and becomes stratum 0.
Analysis of the error by offline signals or console-assisted recovery is made at the time that each process is invoked. In most cases, OLS and console-assisted recovery are processed almost simultaneously and only one recovery situation is visible to the user. However, if the error conditions change between the time that the OLS check is made and console-assisted recovery is run, the final STP recovery decision is based on the analysis of conditions at the time that console-assisted recovery is run.
4.6.2 Console-assisted recovery in a CTN with BTS and Arbiter
In a CTN that has an Arbiter configured, offline signal indications are not used. Console-assisted recovery can be initiated by the BTS or the PTS.
Console-assisted recovery is initiated by the BTS when it has lost communication with both the CTS and the Arbiter. In this case, because the Arbiter is configured, though not available, the offline signals are ignored. Only console-assisted recovery is used.
The BTS sends a command to its SE to determine the state of the CTS by communicating with the CTS through the HMC.
 – If the response indicates that the CTS has failed, the BTS can take over the CTS role.
 – If the response indicates that the status of the CTS is either good or indeterminate, the BTS cannot take over as the new CTS.
Console-assisted recovery is initiated by the PTS when it has lost communication with both the BTS and Arbiter.
The STP-only CTN must only have one CTS, so the status of the BTS must be determined.
The PTS first surrenders its role of CTS as soon as connectivity to the BTS and Arbiter is lost. Then it uses console-assisted recovery in an attempt to determine whether it should take back the CTS role. The PTS sends a command to its SE to determine the state of the BTS by communicating with the BTS through the HMC.
 – If the response indicates that the BTS has failed, the PTS re-takes over its CTS role.
 – If the response indicates that the BTS has taken over the CTS role, or is inconclusive, the PTS does not take back the CTS role.
4.7 Island condition
An island condition occurs when a server detects that one or more servers might be operating as a separate timing network with the same CTN ID, but have a different definition of which servers are performing the PTS, CTS, and Arbiter roles.
4.8 Switch to local timing mode
When a server becomes unsynchronized and transitions to stratum 0, the resident z/OS system images running in STP timing mode switch to local timing mode.
For z/OS systems running on the server, the impact of switching to local timing mode depends on the PLEXCFG parameter in IEASYSxx, and the ETRMODE or STPMODE specified in CLOCKxx.
If a system running in either ETR or STP timing mode loses its time source, then:
 – If plexcfg=xcflocal or monoplex, then the system will continue in LOCAL timing mode. In ETR mode, message IEA261I NO ETR PORTS ARE USABLE. CPC CONTINUES TO RUN IN LOCAL MODE. is issued while in STP mode, message IEA381I THE STP FACILITY IS NOT USABLE. SYSTEM CONTINUES IN LOCAL MODE. is issued.
 – If plexcfg=multisysem or any, the IEA015A is issued in ETR timing mode or IEA394A is issued in STP timing mode. In a Mixed CTN, the S1 z/OS servers issue IEA015A while the S2 z/OS servers issue IEA394A.
z/OS systems that specify PLEXCFG=MULTISYSTEM or PLEXCFG=ANY in IEASYSxx, and ETRMODE YES or STPMODE YES in CLOCKxx, issue WTOR message IEA015A or IEA394A to allow operator intervention to resolve the problem before a wait state is loaded.
z/OS system images that are using a Sysplex timer as their timing source issue WTOR message IEA015A (Example 4-1). The Sysplex timer connectivity needs to be reestablished before a reply with RETRY will be accepted.
Example 4-1 WTOR message IEA015A
IEA015A THIS SYSTEM HAS LOST ALL CONNECTION TO THE SYSPLEX TIMER.

IF THIS EVENT OCCURRED ON SOME, BUT NOT ALL SYSPLEX MEMBERS THE
LIKELY CAUSE IS A LINK FAILURE. TO FIX, ENSURE THAT EACH AFFECTED
SYSTEM HAS AT LEAST ONE CORRECTLY CONNECTED AND FUNCTIONAL LINK.

IF THIS EVENT OCCURRED ON ALL SYSPLEX MEMBERS, THEN THE LIKELY
CAUSE IS A SYSPLEX TIMER FAILURE. TO FIX, REFER TO THE MESSAGE
IEA015A DESCRIPTION IN MVS SYSTEM MESSAGES.

AFTER FIXING THE PROBLEM, REPLY “RETRY” FROM THE SERVICE CONSOLE
(HMC). IF THE PROBLEM WAS NOT CORRECTED, THIS MESSAGE WILL BE
REISSUED AND YOU MAY TRY AGAIN. REPLY “ABORT” TO EXIT MESSAGE
LOOP. PROBABLE RESULT: 0A2-114 WAITSTATE.
z/OS system images that are using a Stratum 1 or Stratum 2 server as timing source issue WTOR message IEA394A (Example 4-2). Once the CEC is Stratum 1, Stratum 2 or Stratum 3 again, a reply of RETRY will be accepted.
Example 4-2 WTOR message IEA394A
IEA394A THIS SERVER HAS LOST CONNECTION TO ITS SOURCE OF TIME.

IF THIS EVENT OCCURRED ON SOME, BUT NOT ALL NETWORK SERVERS THE
LIKELY CAUSE IS A LINK FAILURE. TO FIX, ENSURE THAT EACH AFFECTED
SERVER HAS AT LEAST ONE CORRECTLY CONNECTED AND FUNCTIONAL LINK.

IF THIS EVENT OCCURRED ON ALL NETWORK SERVERS, THEN THE LIKELY
CAUSE IS A TIMING NETWORK FAILURE. TO FIX, REFER TO THE MESSAGE
IEA394A DESCRIPTION IN MVS SYSTEM MESSAGES.

AFTER FIXING THE PROBLEM, REPLY "RETRY" FROM THE SERVICE CONSOLE
(HMC). IF THE PROBLEM IS NOT CORRECTED, THIS MESSAGE WILL BE
REISSUED AND YOU MAY TRY AGAIN. REPLY "ABORT" TO EXIT MESSAGE
LOOP. PROBABLE RESULT: 0A2-158 WAITSTATE.
4.9 External Time Source (ETS)
In an STP-only CTN, the ETS function is available using three different options:
Using dial-out on the hardware management console (HMC)
Using Network Time Protocol (NTP) client support on the Support Element
Using NTP client support on the Support Element along with a pulse per second input on the ETR cards
There are no specific recovery actions when the ETS is configured to use a dial-out service. Depending of the ETS configuration, there are two recovery concepts available when NTP with or without PPS is being used:
NTP server availability: Two NTP servers configured for one System z server
Continuous NTP server availability: NTP server configured for both the PTS and the BTS
 
Note: If STP loses connectivity to all its NTP servers (with or without PPS), all servers in the CTN remain time synchronized. The CST might drift away from the NTP time source until NTP server communication is re-established.
The following list explains terms that are used within the NTP recovery sections:
Network Time Protocol server
This provides the capability to keep the CST synchronized to the ETS to within 100 milliseconds.
NTP server with pulse per second (PPS)
This provides the capability to keep the CST synchronized to the ETS to within 10 microseconds. Therefore, a highly stable and accurate PPS output, provided by an NTP server, is utilized.
Selected and non-selected NTP server
The user is responsible for selecting the preferred NTP server. The preferred NTP server is called the selected NTP server. If two NTP servers have been configured, the second one is called the non-selected NTP server.
For detailed planning information about how CTN reacts to ETS changes, see Server Time Protocol Recovery Guide, SG24-7380.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset