Introduction to IBM PowerHA SystemMirror for AIX
This chapter provides an introduction to IBM PowerHA SystemMirror for newcomers to this solution and a refresher for those users that have implemented PowerHA SystemMirror and used it for many years.
This chapter covers the following topics:
1.1 What is PowerHA SystemMirror for AIX
PowerHA SystemMirror for AIX (also referred to as PowerHA) is the IBM Power Systems data center solution that helps protect critical business applications from outages, both planned and unplanned. One of the major objectives of PowerHA is to offer automatically continued business services by providing redundancy despite different component failures. PowerHA depends on Reliable Scalable Cluster Technology (RSCT) and Cluster Aware AIX (CAA).
RSCT is a set of low-level operating system components that allow the implementation of clustering technologies, such as IBM Spectrum™ Scale (formerly GPFS™). RSCT is distributed with AIX. On the current AIX release, AIX 7.2, RSCT is Version 3.2.1.0. After installing PowerHA and CAA file sets, the RSCT topology services subsystem is deactivated and all its functions are performed by CAA.
PowerHA Version 7.1 and later relies heavily on the CAA infrastructure that was introduced in AIX 6.1 TL6 and AIX 7.1. CAA provides communication interfaces and monitoring provision for PowerHA and execution by using CAA commands with clcmd.
PowerHA Enterprise Edition also provides disaster recovery functions such as cross-site mirroring, IBM HyperSwap®, Geographical Logical Volume Mirroring, and many storage-based replication methods. These cross-site clustering methods support PowerHA functions between two geographic sites. For more information, see the IBM PowerHA SystemMirror 7.1.2 Enterprise Edition for AIX, SG24-8106.
For more information about features that are added in PowerHA V7.1.1 and later, see 1.3, “History and evolution” on page 6.
1.1.1 High availability
In today’s complex environments, providing continuous service for applications is a key component of a successful IT implementation. High availability is one of the components that contributes to providing continuous service for the application clients, by masking or eliminating both planned and unplanned systems and application downtime. A high availability solution ensures that the failure of any component of the solution, either hardware, software, or system management, does not cause the application and its data to become permanently unavailable to the user.
High availability solutions can help to eliminate single points of failure through appropriate design, planning, selection of hardware, configuration of software, control of applications, a carefully controlled environment, and change management discipline.
In short, you can define high availability as the process of ensuring, by using duplicated or shared hardware resources that are managed by a specialized software component, that an application stays up and available for use.
1.1.2 Cluster multiprocessing
In addition to high availability, PowerHA also provides the multiprocessing component. The multiprocessing capability comes from the fact that in a cluster there are multiple hardware and software resources that are managed by PowerHA to provide complex application functions and better resource utilization.
A short definition for cluster multiprocessing might be multiple applications running over several nodes with shared or concurrent access to the data.
Although desirable, the cluster multiprocessing component depends on the application capabilities and system implementation to efficiently use all resources that are available in a multi-node (cluster) environment. This solution must be implemented by starting with the cluster planning and design phase.
PowerHA is only one of the high availability technologies, and it builds on increasingly reliable operating systems, hot-swappable hardware, and increasingly resilient applications, by offering monitoring and automated response.
A high availability solution that is based on PowerHA provides automated failure detection, diagnosis, application recovery, and node reintegration. PowerHA can also provide excellent horizontal and vertical scalability by combining other advanced functions, such as dynamic logical partitioning (DLPAR) and Capacity on Demand (CoD).
1.2 Availability solutions: An overview
Many solutions can provide a wide range of availability options. Table 1-1 lists various types of availability solutions and their characteristics.
Table 1-1 Types of availability solutions
Solution
Downtime
Data availability
Observations
Stand-alone
Days
From last backup
Basic hardware and software
Enhanced standalone
Hours
Until last transaction
Double most hardware components
High availability clustering
Seconds
Until last transaction
Double hardware and additional software costs
Fault-tolerant
Zero
No loss of data
Specialized hardware and software, and expensive
High availability solutions, in general, offer the following benefits:
Standard hardware and networking components (can be used with the existing hardware)
Works with nearly all applications
Works with a wide range of disks and network types
Excellent availability at a reasonable cost
The highly available solution for IBM Power Systems offers distinct benefits:
Proven solution with 27 years of product development
Using off-the-shelf hardware components
Proven commitment for supporting your customers
IP version 6 (IPv6) support for both internal and external cluster communication
Smart Assist technology enabling high availability support for all prominent applications
Flexibility (virtually any application running on a stand-alone AIX system can be protected with PowerHA)
When you plan to implement a PowerHA solution, consider the following aspects:
Thorough high availability (HA) design and detailed planning from end to end
Elimination of single points of failure
Selection of appropriate hardware
Correct implementation (do not take shortcuts)
Disciplined system administration practices and change control
Documented operational procedures
Comprehensive test plan and thorough testing
Figure 1-1 shows a typical PowerHA environment with both IP and non-IP heartbeat networks. Non-IP heartbeat uses the cluster repository disk and an optional storage area network (SAN).
Figure 1-1 PowerHA cluster example
1.2.1 Downtime
Downtime is the period when an application is not available to serve its clients. Downtime can be classified in two categories: planned and unplanned.
Planned
 – Hardware upgrades
 – Hardware or software repair or replacement
 – Software updates or upgrades
 – Backups (offline backups)
 – Testing (Periodic testing is required for good cluster maintenance.)
 – Development
Unplanned
 – Administrator errors
 – Application failures
 – Hardware failures
 – Operating system errors
 – Environmental disasters
The role of PowerHA is to manage the application recovery after the outage. PowerHA provides monitoring and automatic recovery of the resources on which your application depends.
1.2.2 Single point of failure
A single point of failure (SPOF) is any individual component that is integrated into a cluster that, if it fails, renders the application unavailable for users.
Good design can remove single points of failure in the cluster: Nodes, storage, and networks. PowerHA manages these components and also the resources that are required by the application (including the application start/stop scripts).
Ultimately, the goal of any IT solution in a critical environment is to provide continuous application availability and data protection. The high availability is one building block in achieving the continuous operation goal. The high availability is based on the availability of the hardware, software (operating system and its components), application, and network components.
To avoid single points of failure, use the following items:
Redundant servers
Redundant network paths
Redundant storage (data) paths
Redundant (mirrored and RAID) storage
Monitoring of components
Failure detection and diagnosis
Automated application fallover
Automated resource reintegration
A good design avoids single points of failure, and PowerHA can manage the availability of the application through the individual component failures. Table 1-2 lists each cluster object, which, if it fails, can result in loss of availability of the application. Each cluster object can be a physical or logical component.
Table 1-2 Single points of failure
Cluster object
SPOF eliminated by
Node (servers)
Multiple nodes.
Power/power supply
Multiple circuits, power supplies, or uninterruptible power supply (UPS).
Network
Multiple networks that are connected to each node, redundant network paths with independent hardware between each node and the clients.
Network adapters
Redundant adapters, and use other HA type features such as Etherchannel or Shared Ethernet Adapters (SEAs) by way of the Virtual I/O Server (VIOS).
I/O adapters
Redundant I/O adapters and multipathing software.
Controllers
Redundant controllers.
Storage
Redundant hardware, enclosures, disk mirroring or RAID technology, or redundant data paths.
Application
Configuring application monitoring and backup nodes to acquire the application engine and data.
Sites
Use of more than one site for disaster recovery.
Resource groups
A resource group (RG) is a container of resources that are required to run the application. The SPOF is removed by moving the RG around the cluster to avoid failed components.
PowerHA also optimizes availability by allowing for dynamic reconfiguration of running clusters. Maintenance tasks such as adding or removing nodes can be performed without stopping and restarting the cluster.
In addition, by using Cluster Single Point of Control (C-SPOC), other management tasks such as modifying storage and managing users can be performed without interrupting access to the applications that are running in the cluster. C-SPOC also ensures that changes that are made on one node are replicated across the cluster in a consistent manner.
1.3 History and evolution
IBM High Availability Cluster Multi-Processing (HACMP) development started in 1990 to provide high availability solutions for applications running on IBM RS/6000® servers. We do not provide information about the early releases, which are no longer supported or were not in use at the time this publication was written. Instead, we provide highlights about the most recent versions.
Originally designed as a stand-alone product (known as HACMP classic) after the IBM high availability infrastructure known as RSCT) became available, HACMP adopted this technology and became HACMP Enhanced Scalability (HACMP/ES) because it provides performance and functional advantages over the classic version. Starting with HACMP V5.1, there are no more classic versions. Later HACMP terminology was replaced with PowerHA in Version 5.5 and then PowerHA SystemMirror V6.1.
Starting with PowerHA V7.1, the CAA feature of the operating system is used to configure, verify, and monitor the cluster services. This major change improves the reliability of PowerHA because the cluster service functions now run in kernel space rather than user space. CAA was introduced in AIX 6.1 TL6. At the time of writing, the current release is PowerHA V7.2.1.
1.3.1 PowerHA SystemMirror Version 7.1.1
Released in September 2010, PowerHA V7.1.1 introduced improvements to PowerHA in terms of administration, security, and simplification of management tasks. The following list summarizes the improvements in PowerHA V7.1.1:
Federated security allows cluster-wide single point of control, such as:
 – Encrypted file system (EFS) support
 – Role-based access control (RBAC) support
 – Authentication by using LDAP methods
Logical volume manager (LVM) and C-SPOC enhancements:
 – EFS management by C-SPOC
 – Support for mirror pools
 – Disk renaming inside the cluster
 – Support for EMC, Hitachi, and HP disk subsystems multipathing LUN as a clustered repository disk
 – Capability to display a disk Universally Unique Identifier (UUID)
 – File system mounting feature (JFS2 Mount Guard), which prevents simultaneous mounting of the same file system by two nodes, which can cause data corruption
Repository resiliency
Dynamic automatic reconfiguration (DARE) progress indicator
Application management improvements, such as a new application startup option
When you add an application controller, you can choose the application start mode. Now, you can choose background startup mode, which is the default and where the cluster activation moves forward with an application start script that runs in the background, or you can choose foreground start mode. When you choose the application controller option, the cluster activation is sequential, which means that cluster events hold application-start-script execution. If the application script ends with a failure (nonzero return code), the cluster activation also is considered to have failed.
New network features, such as defining a network as private, use of netmon.cf file, and more network tunables
 
Note: Additional details and examples of implementing these features are found in IBM PowerHA SystemMirror Standard Edition 7.1.1 for AIX Update, SG24-8030.
1.3.2 PowerHA SystemMirror Version 7.1.2
Released in October 2012, PowerHA V7.1.2 continued to add features and functions:
Two new cluster types (stretched and linked clusters):
 – Stretched cluster refers to a cluster that has sites that are defined in the same geographic location. It uses a shared repository disk. Extended distance sites with only IP connectivity are not possible with this cluster.
 – Linked cluster refers to a cluster with only IP connectivity across sites and is usually for PowerHA Enterprise Edition.
IPv6 support reintroduced
Backup repository disk
Site support that is reintroduced with Standard Edition
PowerHA Enterprise Edition reintroduced:
 – New HyperSwap support added for DS88XX:
All previous storage replication options that were supported in PowerHA 6.1 are supported:
 • IBM DS8000 Metro Mirror and Global Mirror
 • SAN Volume Controller Metro Mirror and Global Mirror
 • IBM Storwize v7000 Metro Mirror and Global Mirror
 • EMC SRDF synchronous and asynchronous replication
 • Hitachi TrueCopy and HUR replication
 • HP Continuous Access synchronous and asynchronous replication
 – Geographic Logical Volume Manager (GLVM)
Note: Additional details and examples of implementing some of these features are found in IBM PowerHA SystemMirror 7.1.2 Enterprise Edition for AIX, SG24-8106.
1.3.3 PowerHA SystemMirror Version 7.1.3
Released in October 2013, PowerHA V7.1.3 continued the development of PowerHA SystemMirror by adding further improvements in management, configuration simplification, automation, and performance areas. The following list summarizes the improvements in PowerHA V7.1.3:
Unicast heartbeat
Dynamic host name change
Cluster split and merge handling policies
clmgr command enhancements:
 – Embedded hyphen and leading digit support in node labels
 – Native HTML report
 – Cluster copying through snapshots
 – Syntactical built-in help
 – Split and merge support
CAA enhancements:
 – Scalability up to 32 nodes
 – Support for unicast and multicast
 – Dynamic host name or IP address support
HyperSwap enhancements:
 – Active-active sites
 – One node HyperSwap
 – Auto resynchronization of mirroring
 – Node level unmanaged mode support
 – Enhanced repository disk swap management
PowerHA plug-in enhancements for IBM Systems Director:
 – Restore snapshot wizard
 – Cluster simulator
 – Cluster split/merge support
Smart Assist for SAP enhancements
 
Note: Additional details and examples of implementing some of these features are found in IBM PowerHA SystemMirror for AIX Cookbook, SG24-7739.
1.3.4 PowerHA SystemMirror Version 7.2.0
Released in December 2015, PowerHA V7.2 continued the development of PowerHA SystemMirror by adding further improvements in management, configuration simplification, automation, and performance areas. The following list summarizes the improvements in PowerHA V7.2:
Resiliency enhancements:
 – Integrated support for AIX Live Kernel Update (LKU)
 – Automatic Repository Replacement (ARR)
 – Verification enhancements
 – Exploitation of LVM rootvg failure monitoring
 – Live Partition Mobility (LPM) automation
CAA enhancements:
 – Network Failure Detection Tunable per interface
 – Built-in netmon logic
 – Traffic stimulation for better interface failure detection
Enhanced split-brain handling:
 – Quarantine protection against “sick but not dead” nodes
 – NFS Tie Breaker support for split and merge policies
Resource Optimized failovers by way of the Enterprise Pools (Resource Optimized High Availability (ROHA))
Non-disruptive upgrades
The Systems Director plug-in was discontinued in PowerHA V7.2.0.
1.3.5 PowerHA SystemMirror Version 7.2.1
Released in December 2016, PowerHA V7.2.1 added the following additional improvements:
Verification enhancements, some that are carried over from Version 7.2.0:
 – The reserve policy value must not be single path.
 – Checks for the consistency of /etc/filesystem. Do mount points exist and so on?
 – LVM PVID checks across LVM and ODM on various nodes.
 – Uses AIX Runtime Expert checks for LVM and NFS.
 – Checks for network errors. If they cross a threshold (5% of packet count receive and transmit), warn the administrator about the network issue.
 – GLVM buffer size checks.
 – Security configuration (password rules).
 – Kernel parameters: Tunables that are related to AIX network, VMM, and security.
Expanded support of resource optimized failovers by way of the enterprise pools (ROHA).
Browser-based GUI, which is called System Mirror User Interface (SMUI). The initial release is for monitoring and troubleshooting, not configuring clusters.
All split/merge policies are now available to both standard and stretched clusters when using AIX 7.2.1.
1.4 High availability terminology and concepts
To understand the functions of PowerHA and to use it effectively, you must understand several important terms and concepts.
1.4.1 Terminology
The terminology that is used to describe PowerHA configuration and operation continues to evolve. The following terms are used throughout this book:
Node An IBM Power Systems (or LPAR) running AIX and PowerHA that are defined as part of a cluster. Each node has a collection of resources (disks, file systems, IP addresses, and applications) that can be transferred to another node in the cluster in case the node or a component fails.
Cluster A loosely coupled collection of independent systems (nodes) or logical partitions (LPARs) that are organized into a network for the purpose of sharing resources and communicating with each other.
PowerHA defines relationships among cooperating systems where peer cluster nodes provide the services that are offered by a cluster node if that node cannot do so. These individual nodes are responsible for maintaining the functions of one or more applications in case of a failure of any cluster component.
Client A client is a system that can access the application running on the cluster nodes over a local area network (LAN). Clients run a client application that connects to the server (node) where the application runs.
Topology Contains basic cluster components nodes, networks, communication interfaces, and communication adapters.
Resources Logical components or entities that are being made highly available (for example, file systems, raw devices, service IP labels, and applications) by being moved from one node to another. All resources that together form a highly available application or service are grouped in RGs.
PowerHA keeps the RG highly available as a single entity that can be moved from node to node in the event of a component or node failure. RGs can be available from a single node or in the case of concurrent applications, available simultaneously from multiple nodes. A cluster can host more than one RG, thus allowing for efficient use of the cluster nodes.
Service IP label A label that matches to a service IP address and is used for communications between clients and the node. A service IP label is part of an RG, which means that PowerHA can monitor it and keep it highly available.
IP address takeover (IPAT) The process where an IP address is moved from one adapter to another adapter on the same logical network. This adapter can be on the same node or another node in the cluster. If aliasing is used as the method of assigning addresses to adapters, then more than one address can be on a single adapter.
Resource takeover This is the operation of transferring resources between nodes inside the cluster. If one component or node fails because of a hardware or operating system problem, its RGs are moved to another node.
Fallover This represents the movement of an RG from one active node to another node (backup node) in response to a failure on that active node.
Fallback This represents the movement of an RG back from the backup node to the previous node when it becomes available. This movement is typically in response to the reintegration of the previously failed node.
Heartbeat packet A packet that is sent between communication interfaces in the cluster, and is used by the various cluster daemons to monitor the state of the cluster components (nodes, networks, and adapters).
RSCT daemons These consist of two types of processes: topology and group services. PowerHA uses group services, but depends on CAA for topology services. The cluster manager receives event information that is generated by these daemons and takes corresponding (response) actions in case of any failure.
Smart assists A set of high availability agents, called smart assists, are bundled with the PowerHA SystemMirror Standard Edition to help discover and define high availability policies for most common middleware products.
1.5 Fault tolerance versus high availability
Based on the response time and response action to system detected failures, the clusters and systems can belong to one of the following classifications:
Fault-tolerant systems
High availability systems
1.5.1 Fault-tolerant systems
The systems that are provided with fault tolerance are designed to operate virtually without interruption, regardless of the failure that might occur (except perhaps for a complete site shutdown because of a natural disaster). In such systems, all components are at least duplicated for both software or hardware.
All components, CPUs, memory, and disks have a special design and provide continuous service, even if one subcomponent fails. Only special software solutions can run on fault-tolerant hardware.
Such systems are expensive and specialized. Implementing a fault-tolerant solution requires much effort and a high degree of customization for all system components.
For environments where no downtime is acceptable (life critical systems), fault-tolerant equipment and solutions are required.
1.5.2 High availability systems
The systems that are configured for high availability are a combination of hardware and software components that are configured to work together to ensure automated recovery in case of failure with minimal acceptable downtime.
In such systems, the software that is involved detects problems in the environment, and manages application survivability by restarting it on the same or on another available machine (taking over the identity of the original machine node).
Therefore, eliminating all single points of failure (SPOF) in the environment is important. For example, if the machine has only one network interface (connection), provide a second network interface (connection) in the same node to take over in case the primary interface providing the service fails.
Another important issue is to protect the data by mirroring and placing it on shared disk areas that are accessible from any machine in the cluster.
The PowerHA software provides the framework and a set of tools for integrating applications in a highly available system. Applications to be integrated in a PowerHA cluster can require a fair amount of customization, possibly both at the application level and at the PowerHA and AIX platform level. PowerHA is a flexible platform that allows integration of generic applications running on the AIX platform, providing for highly available systems at a reasonable cost.
PowerHA is not a fault-tolerant solution and should not be implemented as such.
1.6 Additional PowerHA resources
Here is a list of additional PowerHA resources and descriptions of each one:
This comprehensive resource contains links to all of the following references and much more.
Base publications
All of the following PowerHA v7 publications are available at IBM Knowledge Center:
 – Administering PowerHA SystemMirror
 – Developing Smart Assist applications for PowerHA SystemMirror
 – Geographic Logical Volume Manager for PowerHA SystemMirror Enterprise Edition
 – Installing PowerHA SystemMirror
 – Planning PowerHA SystemMirror
 – PowerHA SystemMirror concepts
 – PowerHA SystemMirror for IBM Systems Director
 – Programming client applications for PowerHA SystemMirror
 – Quick reference: clmgr command
 – Smart Assists for PowerHA SystemMirror
 – Storage-based high availability and disaster recovery for PowerHA SystemMirror Enterprise Edition
 – Troubleshooting PowerHA SystemMirror
Videos
Shawn Bodily has several PowerHA related videos on his YouTube channel
IBM Redbooks publications
The main focus of each IBM PowerHA Redbooks publication differs a bit, but usually their main focus is covering what is new in a particular release. They generally have more details and advanced tips than the base publications.
Each new publication is rarely a complete replacement for the last. The only exception to this is IBM PowerHA SystemMirror for AIX Cookbook, SG24-7739. It was updated to Version 7.1.3 after replacing two previous cookbooks. It is probably the most comprehensive of all the current IBM Redbooks publications with regard to PowerHA Standard Edition specifically. Although there is some overlap across them, with multiple versions supported, it is important to reference the version of the book that is relevant to the version that you are using.
Figure 1-2 shows a list of relevant PowerHA IBM Redbooks publications. Although it still includes PowerHA 6.1 Enterprise Edition, which is no longer supported, that exact book is still the best reference for configuring EMC SRDF and Hitachi TrueCopy.
Figure 1-2 PowerHA IBM Redbooks publications reference
White papers
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset