IBM PowerHA SystemMirror V7.2.0 and V7.2.1 for IBM AIX new features
This chapter covers the specific features that are new to IBM PowerHA SystemMirror for IBM AIX for Version 7.2 and Version 7.2.1.
This chapter covers the following topics:
PowerHA V7.2 Related:
 – Network Failure Detection Tunable
 – Built-in NETMON logic
 – Traffic stimulation for better interface failure detection
 – Monitor /var usage
 – New lscluster option -g
 – Quarantine protection against “sick but not dead” nodes
 – NFS Tie Breaker support for split and merge policies
Geographic Logical Volume Manager (GLVM) wizard
PowerHA V7.2.1 related:
New option for starting PowerHA by using clmgr
Graphical user interface
Split and Merge enhancements
PowerHA SystemMirror Resource Optimized High Availability (ROHA) enhancements
2.1 Resiliency enhancements
Every release of PowerHA SystemMirror aims to make the product even more resilient than its predecessors. PowerHA SystemMirror for AIX 7.2 continues this tradition.
2.1.1 Integrated support for AIX Live Kernel Update
AIX 7.2 introduced a new capability to allow concurrent patching without interruption to the applications. This capability is known as AIX Live Kernel Update (LKU). Initially, this capability is supported only for interim fixes, but it is the foundation for broader patching of service packs and eventually technologies levels in the future.
 
Tip: For more information about LKU, see AIX Live Updates.
A demonstration of LKU is available in this YouTube video.
Consider the following key points about PowerHA integrated support for LKUs:
LKU can be performed on only one cluster node at a time.
Support includes all PowerHA SystemMirror Enterprise Edition Storage replication features, including HyperSwap and GLVM.
However, for asynchronous GLVM, you must swap to sync mode before LKU is performed, and then swap back to async mode upon LKU completion.
During LKU operation, enhanced concurrent volume groups (VGs) cannot be changed.
Workloads continue to run without interruption.
PowerHA scripts and checks during Live Kernel Update
PowerHA provides scripts that are called during different phases of the AIX LKU notification mechanism. An overview of the PowerHA operations that are performed at which phase follows:
Check phase:
 – Verifies that no other concurrent AIX Live Update is in progress in the cluster.
 – Verifies that the cluster is in stable state.
 – Verifies that there are no GLVM active asynchronous mirror pools.
Pre-phase:
 – Switches the active Enhanced Concurrent VGs to silent mode.
 – Stops the cluster services and SRC daemons.
 – Stops GLVM traffic if required.
Post phase:
 – Restarts GLVM traffic.
 – Restarts System Resource Controller (SRC) daemons and cluster services.
 – Restores the state of the Enhanced Concurrent VGs.
Enabling and disabling AIX Live Kernel Update support of PowerHA
As is the case for most of the features and functions of PowerHA, the feature can be enabled and disabled by using both the System Management Interface Tool (SMIT) and the clmgr command. In either case, it must be set on each node.
When enabling AIX LKU through SMIT, the option is set to either yes or no. However, when you use the clmgr command, the settings are true or false. The default is for it to be enabled (yes/true).
To modify by using SMIT, complete the following steps, as shown in Figure 2-1:
1. Run smitty sysmirror and select Cluster Nodes and Networks → Manage Nodes → Change/Show a Node.
2. Select the wanted node.
3. Set the Enable AIX Live Update operation field as wanted.
4. Press Enter.
                              Change/Show a Node
 
Type or select values in entry fields.
Press Enter AFTER making all wanted changes.
 
[Entry Fields]
* Node Name Jess
New Node Name []
Communication Path to Node [Jess]                +
Enable AIX Live Update operation Yes +
Figure 2-1 Enabling the AIX Live Kernel Update operation
Here is an example of how to check the current value of this setting by using the clmgr command:
[root@Jess] /# clmgr view node Jess |grep LIVE
ENABLE_LIVE_UPDATE="true"
Here is an example of how to disable this setting by using the clmgr command:
[root@Jess] /# clmgr modify node Jess ENABLE_LIVE_UPDATE=false
In order for the change to take effect, the cluster must be synchronized.
Logs that are generated during the AIX Live Kernel Update operation
The two logs that are used during the operation of an AIX LKU are both in the /var/hacmp/log directory:
lvupdate_orig.log This log file keeps information from the original source system logical partition (LPAR).
lvupdate_surr.log This log file keeps information from the target surrogate system LPAR.
 
Tip: A demonstration of performing an LKU on a stand-alone AIX system and not a PowerHA node is available in this YouTube video.
2.1.2 Automatic Repository Replacement
Cluster Aware AIX (CAA) detects when a repository disk failure occurs and generates a notification message. The notification messages continue until the failed repository disk is replaced. PowerHA V7.1.1 introduced the ability to define a backup repository disk. However, the replacement procedure was a manual one. Beginning in PowerHA V7.2 and combined with AIX 7.1.4 or 7.2.0, Automatic Repository Update (ARU) can automatically swap a failed repository disk with the backup repository disk.
A maximum of six repository disks per site can be defined in a cluster. The backup disks are polled once a minute by clconfd to verify that they are still viable for an ARU operation. The steps to define a backup repository disk are the same as in previous versions of PowerHA. These steps and examples of failure situations can be found in 4.2, “Automatic repository update for the repository disk” on page 79.
 
Tip: An overview of configuring and a demonstration of Automatic Repository Replacement (ARR) can be found in this YouTube video.
2.1.3 Verification enhancements
Cluster verification is the framework to check environmental conditions across all nodes in the cluster. Its purpose is to try to ensure proper operation of cluster events when they occur. Every new release of PowerHA provides more verification checks. In PowerHA V7.2, there are both new default additional checks, and a new option for detailed verification checks.
The following new additional checks are the defaults:
Verify that the reserve_policy setting on shared disks is not set to single_path.
Verify that /etc/filesystems entries for shared file systems are consistent across nodes.
The new detailed verification checks, which run only when explicitly enabled, include the following actions:
The physical volume identifier (PVID) checks between the logical volume manager (LVM) and object data manager (ODM) on various nodes.
Use AIX Runtime Expert checks for LVM and network file system (NFS).
Checks whether network errors exceed a predefined 5% threshold.
GLVM buffer size.
Security configuration, such as password rules.
Kernel parameters, such as network, Virtual Memory Manager (VMM), and so on.
Using the new detailed verification checks might add a significant amount of time to the verification process. To enable it, run smitty sysmirror, select Custom Cluster Configuration, then Verify and Synchronize Cluster Configuration (Advanced), and then set the option of Detailed Checks to Yes, as shown in Figure 2-2. This must be set manually each time because it always defaults to No. This option is only available if cluster services are not running.
              PowerHA SystemMirror Verification and Synchronization
 
Type or select values in entry fields.
Press Enter AFTER making all wanted changes.
 
[Entry Fields]
* Verify, Synchronize or Both [Both] +
* Include custom verification library checks [Yes] +
* Automatically correct errors found during [No] +
verification?
 
* Force synchronization if verification fails? [No] +
* Verify changes only? [No] +
* Logging [Standard] +
* Detailed checks Yes +
* Ignore errors if nodes are unreachable ? No +
 
F1=Help F2=Refresh F3=Cancel F4=List
F5=Reset F6=Command F7=Edit F8=Image
F9=Shell F10=Exit Enter=Do
 
Figure 2-2 Enabling detail verification checking
2.1.4 Using Logical Volume Manager rootvg failure monitoring
AIX LVM recently added the capability to change a VG to be a known as a critical VG. Though PowerHA has allowed critical VGs in the past, it applied only to non-operating system/data VGs. PowerHA V7.2 now also takes advantage of this function specifically for rootvg.
If the VG is set as a critical VG, any input/output (I/O) request failure starts the LVM metadata write operation to check the state of the disk before returning the I/O failure. If rootvg has the critical VG option set and if the system cannot access a quorum of rootvg disks or all rootvg disks if quorum is disabled, then the node is failed with a message sent to the console.
You can set and validate rootvg as a critical VG by running the commands that are shown in Figure 2-3. The command must run once because you use the clcmd CAA distributed command.
# clcmd chvg -r y rootvg
# clcmd lsvg rootvg |grep CRIT
DISK BLOCK SIZE: 512 CRITICAL VG: yes
DISK BLOCK SIZE: 512 CRITICAL VG: yes
Figure 2-3 Enabling rootvg as a critical volume group
Testing rootvg failure detection
In this environment, the rootvg is in Storwize V7000 logical unit numbers (LUNs) that are connected to the PowerHA nodes by virtual Fibre Channel (FC) adapters. Simulating a loss of any disk can often be accomplished in multiple ways, but often one of the following methods is used:
From within the storage management, simply unmap the volumes from the host.
Unmap the virtual FC adapter from the real adapter on the Virtual I/O Server (VIOS).
Unzone the virtual worldwide port names (WWPNs) from the storage area network (SAN).
In this environment, we use the first option of unmapping from the storage side. The other two options usually affect all of the disks rather than only rootvg. However, usually that is fine too.
After the rootvg LUN is disconnected and detected, a kernel panic ensues. If the failure occurs on a PowerHA node that is hosting a resource group (RG), then an RG fallover occurs as with any unplanned outage.
If you check the error report after restarting the system successfully, it has a kernel panic entry, as shown in Example 2-1.
Example 2-1 Kernel panic error report entry
---------------------------------------------------------------------------
LABEL: KERNEL_PANIC
IDENTIFIER: 225E3B63
 
Date/Time: Mon Jan 25 21:23:14 CST 2016
Sequence Number: 140
Machine Id: 00F92DB14C00
Node Id: PHA72a
Class: S
Type: TEMP
WPAR: Global
Resource Name: PANIC
 
Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED
 
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
 
Detail Data
ASSERT STRING
 
PANIC STRING
Critical VG Force off, halting.
The node must be restarted and cluster services resumed. As always, when a node rejoins the cluster, movement of RGs might be wanted, or happen automatically depending on the cluster configuration.
2.1.5 Live Partition Mobility automation
Performing a Live Partition Mobility (LPM) operation of a PowerHA node is supported. However, it is not without risk. Because of the unique nature of LPM, certain events, such as network loss, can be triggered during the operation. There have been some suggestions in the past, such as unmanage the node before performing LPM, but many users were unaware of them. As a result, the LPM automation integration feature was created.
 
Note: Previously, it was preferable to unmanage a node before performing LPM, but not many users were aware of this.
PowerHA scripts and checks during Live Partition Mobility
PowerHA provides scripts that are called during different phases of the LPM update notification mechanism. Here is an overview of the PowerHA operations that are performed at which phase:
Check phase:
 – Verify that no other concurrent LPM is in progress in the cluster.
 – Verify that the cluster is in the stable state.
 – Verify network communications between cluster nodes.
Pre-phase:
 – If set, or if IBM HyperSwap is used, stop cluster services in unmanaged mode.
 – On local node, and on peer node in two-node configuration:
 • Stop the Reliable Scalable Cluster Technology (RSCT) Dead Man Switch (DMS).
 • If HEARTBEAT_FREQUENCY_FOR_LPM is set, change the CAA node timeout.
 • If CAA deadman_mode at per-node level is a, set it to e.
 
Note: A deadman switch is an action that occurs when CAA detects that a node has become isolated in a multinode environment. This setting occurs when nodes are not communicating with each other through the network and the repository disk.
The AIX operating system can react differently depending on the deadman switch setting or the deadman_mode, which is tunable. The deadman switch mode can be set to either force a system shutdown or generate an Autonomic Health Advisor File System (AHAFS) event.
 – Restrict SAN communications across nodes.
Post phase:
 – Restart cluster services.
 – On local node, and on peer node in two-node configuration:
 • Restart the RSCT DMS.
 • Restore the CAA node timeout.
 • Restore the CAA deadman_mode.
 – Re-enable SAN communications across nodes.
The following new cluster heartbeat settings are associated with the auto handling of LPM:
Node Failure Detection Timeout during LPM
If specified, this timeout value (in seconds) is used during an LPM instead of the Node Failure Detection Timeout value.
You can use this option to increase the Node Failure Detection Timeout during the LPM duration to ensure that it is greater than the LPM freeze duration to avoid any risk of unwanted cluster events. Enter a value 10 - 600.
LPM Node Policy
This specifies the action to be taken on the node during an LPM operation.
If unmanage is selected, the cluster services are stopped with the Unmanage Resource Groups option during the duration of the LPM operation. Otherwise, PowerHA SystemMirror continues to monitor the RGs and application availability.
As is common, these options can be set by using both SMIT and the clmgr command line. To change these options by using SMIT, run smitty sysmirror and select Custom Cluster Configuration → Cluster Nodes and Networks → Manage the Cluster → Cluster Heartbeat Settings, as shown in Figure 2-4.
                     Cluster heartbeat settings
 
Type or select values in entry fields.
Press Enter AFTER making all wanted changes.
 
[Entry Fields]
 
* Network Failure Detection Time [20] #
* Node Failure Detection Timeout [30] #
* Node Failure Detection Grace Period [10] #
* Node Failure Detection Timeout during LPM [120]                    #
* LPM Node Policy [unmanage]               +
 
 
F1=Help F2=Refresh F3=Cancel F4=List
F5=Reset F6=Command F7=Edit F8=Image
F9=Shell F10=Exit Enter=Do
 
Figure 2-4 Enabling LPM integration
An example of using clmgr to check and change these settings is shown in Example 2-2.
Example 2-2 Using the clmgr command
[root@Jess] /# clmgr query cluster |grep LPM
LPM_POLICY=""
HEARTBEAT_FREQUENCY_DURING_LPM="0"
 
[root@Jess] /# clmgr modify cluster HEARTBEAT_FREQUENCY_DURING_LPM="120"
[root@Jess] /# clmgr modify cluster LPM_POLICY=unmanage
 
[root@Jess] /# clmgr query cluster |grep LPM
LPM_POLICY="120"
HEARTBEAT_FREQUENCY_DURING_LPM="unmanage"
Even with these new automated steps, there are still a few manual steps when using SAN communication:
Before LPM
Verify that the tme attribute is set to yes on the target systems VIOS FC adapters.
After LPM
Reestablish SAN communication between VIOS and the client LPAR through a virtual local area network (VLAN) 3358 adapter configuration.
No matter which method you chose to change these settings, the cluster must be synchronized for the change to take effect cluster-wide.
2.2 Cluster Aware AIX enhancements
In every new AIX level, CAA is also updated. The CAA version typically references the year in which it was released. For example, the AIX 7.2 CAA level is referenced as the 2015 version, also known as release 4. Table 2-1 shows the matching AIX and PowerHA levels to the CAA versions. This section continues with features that are new to CAA (2015/R4).
Table 2-1 IBM AIX and PowerHA levels to CAA versions
Internal version
External release
AIX level
PowerHA level
2011
R1
6.1.7/7.1.1
7.1.1
2012
R2
6.1.8/7.1.2
7.1.2
2013
R3
6.1.9/7.1.3
7.1.3
2015
R4
7.1.4/7.2.0
7.2
2016
R5
7.2.1
7.2.1
 
Note: The listed AIX and PowerHA levels are the preferred combinations to use all new features. However, these are not the only possible combinations.
2.2.1 Network failure detection tunable
PowerHA V7.1 had a fixed latency for network failure detection that was about 5 seconds. In PowerHA V7.2, the default is now 20 seconds. The tunable is named network_fdt.
 
Note: The network_fdt tunable is also available for PowerHA V7.1.3. To get it for PowerHA V7.1.3, you must open a PMR and request the “Tunable FDT interim fix bundle”.
The self-adjusting network heartbeat behavior (CAA), which was introduced with PowerHA V7.1.0, still exists and still is used. It has no impact to the network failure detection time.
2.2.2 Built-in NETMON logic
NETMON logic was previously handled by RSCT. As it was difficult to keep both CAA and RSCT layers synchronized about the adapter state, NETMON logic was moved within the CAA layer.
The configuration file remains the same, namely /usr/es/sbin/cluster/netmon.cf. A !REQD entry in the netmon.cf file indicates that special handling is needed that is different than the traditional netmon methods. For more information about netmon.cf file usage and formatting, see IBM Knowledge Center.
2.2.3 Traffic stimulation for better interface failure detection
Multicast pings are sent to the all hosts multicast group just before marking an interface as down. This ping is distributed to the nodes within the subnet. Any node receiving this request replies (even the node is not a part of the cluster), and thus generates incoming traffic on the adapter. Multicast ping uses the address 224.0.0.1. All nodes register by default for this multicast group. Therefore, there is a good chance that some incoming traffic is generated by this method.
2.2.4 Monitoring /var usage
Starting with PowerHA V7.2.0 the /var file system is monitored by default. This monitoring is done by the clconfd subsystem. The following default values are used:
Threshold 75% (range 70 - 95)
Interval 15 min (range 5 - 30)
2.2.5 New lscluster option -g
Starting with AIX V7.1 TL4 and AIX V7.2, there is an additional option for the CAA lscluster command.
The new option -g lists the interfaces that can potentially be used as a communication paths of CAA between the cluster nodes. For a more detailed description, see 4.1.4, “New lscluster option -g” on page 69.
2.2.6 CAA level added to the lscluster -c output
Starting with the AIX 7.2.1 the lscluster -c command also displays the CAA level. This command is useful if you need to know whether the new network failure detection tunable is supported by your installation by default. For more information, see 2.2.1, “Network failure detection tunable” on page 25 and 4.1.5, “Interface failure detection” on page 78.
This add-on was back level converted and is automatically included with AIX 7.1.4.2 or newer and with AIX 7.2.0.2 or newer.
2.3 Enhanced split-brain handling
Split-brain, also known as a partitioned cluster, refers to when all communications are lost between cluster nodes, yet the nodes are still running. PowerHA V7.2 supports new policies to quarantine a sick or dead active node. These policies help handle the cluster-split scenarios to ensure data protection during split scenarios. The following two new policies are supported:
Disk fencing
Disk fencing uses the Small Computer System Interface (SCSI-3) Persistent Reservation mechanism to fence out the sick or dead node to block future writes from the sick node.
Hardware Management Console (HMC)-based Active node shutdown
In the HMC-based Active node shutdown policy, standby node works with HMC to kill the previously active (sick) node, and only then starts the workload on the standby.
2.4 Resource Optimized High Availability fallovers by using enterprise pools
PowerHA offers integrated support for dynamic LPAR (DLPAR), including using capacity on demand (CoD) resources since IBM HACMP 5.3. However, the type of CoD support was limited. Now, PowerHA V7.2 extends support to include Enterprise Pool CoD (EPCoD) and elastic CoD resources. Using these types of resources makes the solution less expensive to acquire and less expensive to own.
PowerHA SystemMirror 7.2.1 has the following enhancements compared to PowerHA SystemMirror 7.2.0:
New ROHA tunable resource_allocation_order
You can use this to define the order in which hardware resources (CPU and memory) are allocated. Resources are released in reverse of the resource allocation order.
New ROHA tunable ALWAYS_START_RG
You can use this tunable to do the fallover (start the RG) even if there are not enough resources available (CPU or memory).
Cross-HMC support
This supports the new Enterprise Pool capabilities.
PowerHA V7.2.0 has the following requirements:
PowerHA SystemMirror 7.2, Standard Edition or Enterprise Edition
One of the following AIX levels:
 – AIX 6.1 TL09 SP5
 – AIX 7.1 TL03 SP5
 – AIX 7.1 TL4
 – AIX 7.2 or later
HMC requirement
 – HMC V7.8 or later
 – HMC must have a minimum of 2 GB of memory
Hardware requirement for using Enterprise Pool CoD license
 – IBM POWER7+: 9117-MMD, 9179-MHD with FW780.10 or later
 – IBM POWER8®: 9119-MME, 9119-MHE with FW820 or later
Full details about using this integrated support can be found in Chapter 6, “Resource Optimized High Availability” on page 139.
2.5 Nondisruptive upgrades
PowerHA V7.2 enables nondisruptive cluster upgrades. It allows upgrades from PowerHA V7.1.3 to V7.2 without having to roll over the workload from one node to another as part of the migration. The key requirement is that the existing AIX/CAA levels must be either 6.1.9 or 7.1.3. More information about performing nondisruptive upgrades can be found in 5.2.5, “Nondisruptive upgrade from PowerHA V7.1.3” on page 120.
 
Tip: A demonstration of performing a nondisruptive upgrade can be found in this YouTube video.
2.6 Geographic Logical Volume Manager wizard
PowerHA 6.1 introduced the first two-site GLVM configuration. However, it was limited to only synchronous implementations and still required some manual steps. PowerHA V7.2 introduces an enhanced GLVM wizard that involves fewer steps but also includes support for asynchronous implementations. More details can be found in Chapter 7, “Geographic Logical Volume Manager configuration assistant” on page 241.
2.7 New option for starting PowerHA by using clmgr
Starting with PowerHA V7.2.1, you can use an additional management option to start the cluster. The new argument for the option is named delayed:
clmgr online cluster manage=delayed
 
Note: This new option was backported to PowerHA V7.2 and V7.3.1.
At the time of writing, the only way to obtain the new option is to open a PMR and ask for an interim fix for the defect 100862, or ask for an interim fix for APAR IV90262.
Since PowerHA V7.1, the dependency “Start After” is available.
2.8 Graphical user interface
PowerHA V7.2.1 contains a new GUI. The focus for its design is:
Quick and easy status revelation
Easier way to view events
Easier way to view logs
The new GUI has several features. The following list is just a brief overview. For a detailed description, see Chapter 9, “IBM PowerHA SystemMirror User Interface” on page 299.
Visually display of relationships among resources.
Systems with the highest severity problems are highly visible.
Visualizes the health status for each resource.
Formatted events are easy to scan.
Visually distinguish critical, warning, and maintenance events.
Organized by day and time.
Can filter and search for specific types of events.
You can see the progression of events by using the timeline.
You can zoom in to see details or zoom out to see health over time.
You can search for an event in the event log.
If your system has internet access you can open a browser to the PowerHA IBM Knowledge Center.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset