IBM predictive failure analysis
This chapter describes the IBM predictive failure analysis (PFA) enhancements in z/OS V2R2 and includes the following topics:
4.1 PFA overview
Software detected system failures are categorized into one of the following types:
Masked failures: These failures are software detected system failures, which are detected and corrected by the software component.
Hard failures: This failure occurs when the software fails completely; for example, when an operator stops a process.
Failures that are caused by abnormal behavior: These failures are unexpected or unusual situations that cause the software component to not provide the requested service. In combination with events that often do not generate failures, secondary effects can occur that might eventually result in a system or sysplex outage.
The idea behind PFA is to predict potential problems that might arise in your z/OS environment. These potential problems are not hard failures; instead, these problems are soft failures that can be categorized into the following areas:
Exhaustion of shared resources
Recurring failures that are caused by damage to critical control structures
Serialization problems, such as classic deadlocks and priority inversions
Unexpected state transitions
These soft failures can lead to situations that are called sick but not dead (SBND). These situations have the following characteristics:
It is difficult for components to be detected internally.
They are probabilistic, but not deterministic.
Situations that arise in this area cause 20% of the problems. Because of their long duration, these situations generate 80% of the business impact.
PFA was originally made available as a small program enhancement (SPE) in z/OS V1R10 with the first two checks, and then officially in z/OS V1R11 with the next two checks. These checks reviewed the following factors:
Common storage usage
LOGREC arrival rate
Virtual storage usage
Message arrival rate
In z/OS V2R2, PFA is improved to ease the set-up and installation processes by completing more verification when it starts.
4.2 PFA summary of changes
This section describes the new functions that were introduced in IBM z/OS V2R2.
4.2.1 Private storage exhaustion check
In z/OS V2R2, PFA monitors several ranges of private area virtual storage for multiple address spaces and warns you when one or more address spaces exceed criteria that can indicate eventual private area virtual storage exhaustion. PFA is constructed largely in Java, and most of the CPU that is used by PFA can run on IBM System z® Integrated Information Processor (zIIP) specialty engines.
In this release, address spaces can be included or excluded by using the new INCLUDED_JOBS file or the EXCLUDED_JOBS file, which you can specify to monitor critical jobs or those persistent batch jobs for processing message queues in your installation. Data that is collected for Jobs supports dynamic checks for the following components:
PFA_PRIVATE_STORAGE_EXHAUSTION
PFA_JES_SPOOL_USAGE
PFA_MESSAGE_ARRIVAL_RATE
PFA_SMF_ARRIVAL_RATE
PFA_ENQUEUE_REQUEST_RATE
PFA can be dynamically updated by using z/OS console commands for the following private storage exhaustion and JES spool usage checks:
F PFA,UPDATE,CHECK(PFA_P*),INCLUDED_JOBS
F PFA,UPDATE,CHECK(PFA_J*)
The PFA_PRIVATE_STORAGE_EXHAUSTION check detects future exhaustion of private storage that is under 2 GB in six storage locations within the following individual address spaces:
Private user region: USER
Private authorized area: AUTH
Private user and private authorized: BELOW the line
Extended private user region: EUSER
Extended private authorized area: EAUTH
Extended private user and extended private authorized: ABOVE the line
Figure 4-1 shows the different virtual storage locations in z/OS that are detected by this check.
Figure 4-1 Virtual storage locations detected by PFA check PFA_PRIVATE_STORAGE_EXHAUSTION
The following components are shown in Figure 4-1:
USER: User region in the private area
EUSER: User region in the extended private area
AUTH: LSQA, SWA, subpools 229, and 230 in the private area
EAUTH: LSQA, SWA, subpools 229, and 230 in the extended private area
ABOVE: The extended user private area above 16 M (the sum of EUSER+EAUTH)
BELOW: The user private area below 16 M (the sum of USER+AUTH)
This check does not detect exhaustion that is caused by the following factors:
Fragmentation
Fast increases of usage that are on a machine-time scale or even faster than one collection interval
PFA uses dynamic severity, which means that as time to exhaustion gets closer, the severity of the PFA exception increases. It is used for PFA_COMMON_STORAGE_USAGE and PFA_PRIVATE_STORAGE_EXHAUSTION checks only.
As with the other PFA checks, this check is added to the Health Checker when PFA is started. If you want to see all current values for this check, enter the following command:
F PFA,DISPLAY,CHECK(PFA_P*),DETAIL
The result of this command is shown in Figure 4-2.
F PFA,DISPLAY,CHECK(PFA_P*),DETAIL
AIR018I 17:33:21 PFA CHECK DETAIL 461
CHECK NAME: PFA_PRIVATE_STORAGE_EXHAUSTION
ACTIVE : YES
TOTAL COLLECTION COUNT : 1733
SUCCESSFUL COLLECTION COUNT : 1733
LAST COLLECTION TIME : 08/17/2015 17:31:42
LAST SUCCESSFUL COLLECTION TIME: 08/17/2015 17:31:42
NEXT COLLECTION TIME : 08/17/2015 17:36:42
TOTAL MODEL COUNT : 12
SUCCESSFUL MODEL COUNT : 12
LAST MODEL TIME : 08/17/2015 10:05:57
LAST SUCCESSFUL MODEL TIME : 08/17/2015 10:05:57
NEXT MODEL TIME : 08/17/2015 22:05:57
CHECK SPECIFIC PARAMETERS:
COLLECTINT : 5
MODELINT : 720
COLLECTINACTIVE : 1=ON
DEBUG : 0=OFF
EXCDIRDAYS : 90
FORCEMODEL : NO
COLL% : 20
COLLUPTIME : 180
MOD% : 40
COMP% : 100
E_HIGH : 180
E_MED : 300
E_LOW : MAX
E_NONE : UNUSED
Figure 4-2 Output of the PFA display command
Figure 4-2 also shows the following defaults for the time to exhaustion:
E_HIGH(180): If time to exhaustion is predicted to be 0 - 180 minutes from now, a critical eventual action WTO is issued.
E_MED(300): If time to exhaustion is predicted to be from more than E_HIGH minutes to 300 minutes from now, an eventual action write to operator (WTO) is issued.
E_LOW(MAX): If time to exhaustion is predicted to be from more than E_MED minutes to the expiration of the prediction, an informational WTO is issued.
E_NONE(UNUSED): A value of 0 or UNUSED for the number of minutes indicates that this dynamic severity is not used.
This information is also used for the PFA_COMMON_STORAGE_USAGE check.
Benefit and value
This check improves system availability by providing information that you can use before storage exhaustion if the rate of storage consumption is excessive.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset