Chapter 1. Introduction to IBM PowerHA SystemMirror for AIX

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Introduction to IBM PowerHA SystemMirror for AIX

This chapter provides an introduction to IBM PowerHA SystemMirror for newcomers to this solution and a refresher for those users that have implemented PowerHA SystemMirror and used it for many years.

This chapter covers the following topics:

•What is PowerHA SystemMirror for AIX

•Availability solutions: An overview

•History and evolution

•High availability terminology and concepts

•Fault tolerance versus high availability

•Additional PowerHA resources

1.1 What is PowerHA SystemMirror for AIX

PowerHA SystemMirror for AIX (also referred to as PowerHA) is the IBM Power Systems data center solution that helps protect critical business applications from outages, both planned and unplanned. One of the major objectives of PowerHA is to offer automatically continued business services by providing redundancy despite different component failures. PowerHA depends on Reliable Scalable Cluster Technology (RSCT) and Cluster Aware AIX (CAA).

RSCT is a set of low-level operating system components that allow the implementation of clustering technologies, such as IBM Spectrum™ Scale (formerly GPFS™). RSCT is distributed with AIX. On the current AIX release, AIX 7.2, RSCT is Version 3.2.1.0. After installing PowerHA and CAA file sets, the RSCT topology services subsystem is deactivated and all its functions are performed by CAA.

PowerHA Version 7.1 and later relies heavily on the CAA infrastructure that was introduced in AIX 6.1 TL6 and AIX 7.1. CAA provides communication interfaces and monitoring provision for PowerHA and execution by using CAA commands with clcmd.

PowerHA Enterprise Edition also provides disaster recovery functions such as cross-site mirroring, IBM HyperSwap®, Geographical Logical Volume Mirroring, and many storage-based replication methods. These cross-site clustering methods support PowerHA functions between two geographic sites. For more information, see the IBM PowerHA SystemMirror 7.1.2 Enterprise Edition for AIX, SG24-8106.

For more information about features that are added in PowerHA V7.1.1 and later, see 1.3, “History and evolution” on page 6.

1.1.1 High availability

In today’s complex environments, providing continuous service for applications is a key component of a successful IT implementation. High availability is one of the components that contributes to providing continuous service for the application clients, by masking or eliminating both planned and unplanned systems and application downtime. A high availability solution ensures that the failure of any component of the solution, either hardware, software, or system management, does not cause the application and its data to become permanently unavailable to the user.

High availability solutions can help to eliminate single points of failure through appropriate design, planning, selection of hardware, configuration of software, control of applications, a carefully controlled environment, and change management discipline.

In short, you can define high availability as the process of ensuring, by using duplicated or shared hardware resources that are managed by a specialized software component, that an application stays up and available for use.

1.1.2 Cluster multiprocessing

In addition to high availability, PowerHA also provides the multiprocessing component. The multiprocessing capability comes from the fact that in a cluster there are multiple hardware and software resources that are managed by PowerHA to provide complex application functions and better resource utilization.

A short definition for cluster multiprocessing might be multiple applications running over several nodes with shared or concurrent access to the data.

Although desirable, the cluster multiprocessing component depends on the application capabilities and system implementation to efficiently use all resources that are available in a multi-node (cluster) environment. This solution must be implemented by starting with the cluster planning and design phase.

PowerHA is only one of the high availability technologies, and it builds on increasingly reliable operating systems, hot-swappable hardware, and increasingly resilient applications, by offering monitoring and automated response.

A high availability solution that is based on PowerHA provides automated failure detection, diagnosis, application recovery, and node reintegration. PowerHA can also provide excellent horizontal and vertical scalability by combining other advanced functions, such as dynamic logical partitioning (DLPAR) and Capacity on Demand (CoD).

1.2 Availability solutions: An overview

Many solutions can provide a wide range of availability options. Table 1-1 lists various types of availability solutions and their characteristics.

Table 1-1 Types of availability solutions

Solution	Downtime	Data availability	Observations
Stand-alone	Days	From last backup	Basic hardware and software
Enhanced standalone	Hours	Until last transaction	Double most hardware components
High availability clustering	Seconds	Until last transaction	Double hardware and additional software costs
Fault-tolerant	Zero	No loss of data	Specialized hardware and software, and expensive

High availability solutions, in general, offer the following benefits:

•Standard hardware and networking components (can be used with the existing hardware)

•Works with nearly all applications

•Works with a wide range of disks and network types

•Excellent availability at a reasonable cost

The highly available solution for IBM Power Systems offers distinct benefits:

•Proven solution with 27 years of product development

•Using off-the-shelf hardware components

•Proven commitment for supporting your customers

•IP version 6 (IPv6) support for both internal and external cluster communication

•Smart Assist technology enabling high availability support for all prominent applications

•Flexibility (virtually any application running on a stand-alone AIX system can be protected with PowerHA)

When you plan to implement a PowerHA solution, consider the following aspects:

•Thorough high availability (HA) design and detailed planning from end to end

•Elimination of single points of failure

•Selection of appropriate hardware

•Correct implementation (do not take shortcuts)

•Disciplined system administration practices and change control

•Documented operational procedures

•Comprehensive test plan and thorough testing

Figure 1-1 shows a typical PowerHA environment with both IP and non-IP heartbeat networks. Non-IP heartbeat uses the cluster repository disk and an optional storage area network (SAN).

Figure 1-1 PowerHA cluster example

1.2.1 Downtime

Downtime is the period when an application is not available to serve its clients. Downtime can be classified in two categories: planned and unplanned.

•Planned

– Hardware upgrades

– Hardware or software repair or replacement

– Software updates or upgrades

– Backups (offline backups)

– Testing (Periodic testing is required for good cluster maintenance.)

– Development

•Unplanned

– Administrator errors

– Application failures

– Hardware failures

– Operating system errors

– Environmental disasters

The role of PowerHA is to manage the application recovery after the outage. PowerHA provides monitoring and automatic recovery of the resources on which your application depends.

1.2.2 Single point of failure

A single point of failure (SPOF) is any individual component that is integrated into a cluster that, if it fails, renders the application unavailable for users.

Good design can remove single points of failure in the cluster: Nodes, storage, and networks. PowerHA manages these components and also the resources that are required by the application (including the application start/stop scripts).

Ultimately, the goal of any IT solution in a critical environment is to provide continuous application availability and data protection. The high availability is one building block in achieving the continuous operation goal. The high availability is based on the availability of the hardware, software (operating system and its components), application, and network components.

To avoid single points of failure, use the following items:

• Redundant servers

• Redundant network paths

• Redundant storage (data) paths

• Redundant (mirrored and RAID) storage

• Monitoring of components

• Failure detection and diagnosis

• Automated application fallover

• Automated resource reintegration

A good design avoids single points of failure, and PowerHA can manage the availability of the application through the individual component failures. Table 1-2 lists each cluster object, which, if it fails, can result in loss of availability of the application. Each cluster object can be a physical or logical component.

Table 1-2 Single points of failure

Cluster object	SPOF eliminated by
Node (servers)	Multiple nodes.
Power/power supply	Multiple circuits, power supplies, or uninterruptible power supply (UPS).
Network	Multiple networks that are connected to each node, redundant network paths with independent hardware between each node and the clients.
Network adapters	Redundant adapters, and use other HA type features such as Etherchannel or Shared Ethernet Adapters (SEAs) by way of the Virtual I/O Server (VIOS).
I/O adapters	Redundant I/O adapters and multipathing software.
Controllers	Redundant controllers.
Storage	Redundant hardware, enclosures, disk mirroring or RAID technology, or redundant data paths.
Application	Configuring application monitoring and backup nodes to acquire the application engine and data.
Sites	Use of more than one site for disaster recovery.
Resource groups	A resource group (RG) is a container of resources that are required to run the application. The SPOF is removed by moving the RG around the cluster to avoid failed components.

PowerHA also optimizes availability by allowing for dynamic reconfiguration of running clusters. Maintenance tasks such as adding or removing nodes can be performed without stopping and restarting the cluster.

In addition, by using Cluster Single Point of Control (C-SPOC), other management tasks such as modifying storage and managing users can be performed without interrupting access to the applications that are running in the cluster. C-SPOC also ensures that changes that are made on one node are replicated across the cluster in a consistent manner.

1.3 History and evolution

IBM High Availability Cluster Multi-Processing (HACMP) development started in 1990 to provide high availability solutions for applications running on IBM RS/6000® servers. We do not provide information about the early releases, which are no longer supported or were not in use at the time this publication was written. Instead, we provide highlights about the most recent versions.

Originally designed as a stand-alone product (known as HACMP classic) after the IBM high availability infrastructure known as RSCT) became available, HACMP adopted this technology and became HACMP Enhanced Scalability (HACMP/ES) because it provides performance and functional advantages over the classic version. Starting with HACMP V5.1, there are no more classic versions. Later HACMP terminology was replaced with PowerHA in Version 5.5 and then PowerHA SystemMirror V6.1.

Starting with PowerHA V7.1, the CAA feature of the operating system is used to configure, verify, and monitor the cluster services. This major change improves the reliability of PowerHA because the cluster service functions now run in kernel space rather than user space. CAA was introduced in AIX 6.1 TL6. At the time of writing, the current release is PowerHA V7.2.1.

1.3.1 PowerHA SystemMirror Version 7.1.1

Released in September 2010, PowerHA V7.1.1 introduced improvements to PowerHA in terms of administration, security, and simplification of management tasks. The following list summarizes the improvements in PowerHA V7.1.1:

•Federated security allows cluster-wide single point of control, such as:

– Encrypted file system (EFS) support

– Role-based access control (RBAC) support

– Authentication by using LDAP methods

•Logical volume manager (LVM) and C-SPOC enhancements:

– EFS management by C-SPOC

– Support for mirror pools

– Disk renaming inside the cluster

– Support for EMC, Hitachi, and HP disk subsystems multipathing LUN as a clustered repository disk

– Capability to display a disk Universally Unique Identifier (UUID)

– File system mounting feature (JFS2 Mount Guard), which prevents simultaneous mounting of the same file system by two nodes, which can cause data corruption

•Repository resiliency

•Dynamic automatic reconfiguration (DARE) progress indicator

•Application management improvements, such as a new application startup option

When you add an application controller, you can choose the application start mode. Now, you can choose background startup mode, which is the default and where the cluster activation moves forward with an application start script that runs in the background, or you can choose foreground start mode. When you choose the application controller option, the cluster activation is sequential, which means that cluster events hold application-start-script execution. If the application script ends with a failure (nonzero return code), the cluster activation also is considered to have failed.

•New network features, such as defining a network as private, use of netmon.cf file, and more network tunables

Note: Additional details and examples of implementing these features are found in IBM PowerHA SystemMirror Standard Edition 7.1.1 for AIX Update, SG24-8030.

1.3.2 PowerHA SystemMirror Version 7.1.2

Released in October 2012, PowerHA V7.1.2 continued to add features and functions:

•Two new cluster types (stretched and linked clusters):

– Stretched cluster refers to a cluster that has sites that are defined in the same geographic location. It uses a shared repository disk. Extended distance sites with only IP connectivity are not possible with this cluster.

– Linked cluster refers to a cluster with only IP connectivity across sites and is usually for PowerHA Enterprise Edition.

•IPv6 support reintroduced

•Backup repository disk

•Site support that is reintroduced with Standard Edition

•PowerHA Enterprise Edition reintroduced:

– New HyperSwap support added for DS88XX:

All previous storage replication options that were supported in PowerHA 6.1 are supported:

• IBM DS8000 Metro Mirror and Global Mirror

• SAN Volume Controller Metro Mirror and Global Mirror

• IBM Storwize v7000 Metro Mirror and Global Mirror

• EMC SRDF synchronous and asynchronous replication

• Hitachi TrueCopy and HUR replication

• HP Continuous Access synchronous and asynchronous replication

– Geographic Logical Volume Manager (GLVM)

Note: Additional details and examples of implementing some of these features are found in IBM PowerHA SystemMirror 7.1.2 Enterprise Edition for AIX, SG24-8106.

1.3.3 PowerHA SystemMirror Version 7.1.3

Released in October 2013, PowerHA V7.1.3 continued the development of PowerHA SystemMirror by adding further improvements in management, configuration simplification, automation, and performance areas. The following list summarizes the improvements in PowerHA V7.1.3:

•Unicast heartbeat

•Dynamic host name change

•Cluster split and merge handling policies

•clmgr command enhancements:

– Embedded hyphen and leading digit support in node labels

– Native HTML report

– Cluster copying through snapshots

– Syntactical built-in help

– Split and merge support

•CAA enhancements:

– Scalability up to 32 nodes

– Support for unicast and multicast

– Dynamic host name or IP address support

•HyperSwap enhancements:

– Active-active sites

– One node HyperSwap

– Auto resynchronization of mirroring

– Node level unmanaged mode support

– Enhanced repository disk swap management

•PowerHA plug-in enhancements for IBM Systems Director:

– Restore snapshot wizard

– Cluster simulator

– Cluster split/merge support

•Smart Assist for SAP enhancements

Note: Additional details and examples of implementing some of these features are found in IBM PowerHA SystemMirror for AIX Cookbook, SG24-7739.

1.3.4 PowerHA SystemMirror Version 7.2.0

Released in December 2015, PowerHA V7.2 continued the development of PowerHA SystemMirror by adding further improvements in management, configuration simplification, automation, and performance areas. The following list summarizes the improvements in PowerHA V7.2:

•Resiliency enhancements:

– Integrated support for AIX Live Kernel Update (LKU)

– Automatic Repository Replacement (ARR)

– Verification enhancements

– Exploitation of LVM rootvg failure monitoring

– Live Partition Mobility (LPM) automation

•CAA enhancements:

– Network Failure Detection Tunable per interface

– Built-in netmon logic

– Traffic stimulation for better interface failure detection

•Enhanced split-brain handling:

– Quarantine protection against “sick but not dead” nodes

– NFS Tie Breaker support for split and merge policies

•Resource Optimized failovers by way of the Enterprise Pools (Resource Optimized High Availability (ROHA))

•Non-disruptive upgrades

The Systems Director plug-in was discontinued in PowerHA V7.2.0.

1.3.5 PowerHA SystemMirror Version 7.2.1

Released in December 2016, PowerHA V7.2.1 added the following additional improvements:

•Verification enhancements, some that are carried over from Version 7.2.0:

– The reserve policy value must not be single path.

– Checks for the consistency of /etc/filesystem. Do mount points exist and so on?

– LVM PVID checks across LVM and ODM on various nodes.

– Uses AIX Runtime Expert checks for LVM and NFS.

– Checks for network errors. If they cross a threshold (5% of packet count receive and transmit), warn the administrator about the network issue.

– GLVM buffer size checks.

– Security configuration (password rules).

– Kernel parameters: Tunables that are related to AIX network, VMM, and security.

•Expanded support of resource optimized failovers by way of the enterprise pools (ROHA).

•Browser-based GUI, which is called System Mirror User Interface (SMUI). The initial release is for monitoring and troubleshooting, not configuring clusters.

•All split/merge policies are now available to both standard and stretched clusters when using AIX 7.2.1.

1.4 High availability terminology and concepts

To understand the functions of PowerHA and to use it effectively, you must understand several important terms and concepts.

1.4.1 Terminology

The terminology that is used to describe PowerHA configuration and operation continues to evolve. The following terms are used throughout this book:

Node An IBM Power Systems (or LPAR) running AIX and PowerHA that are defined as part of a cluster. Each node has a collection of resources (disks, file systems, IP addresses, and applications) that can be transferred to another node in the cluster in case the node or a component fails.

Cluster A loosely coupled collection of independent systems (nodes) or logical partitions (LPARs) that are organized into a network for the purpose of sharing resources and communicating with each other.

PowerHA defines relationships among cooperating systems where peer cluster nodes provide the services that are offered by a cluster node if that node cannot do so. These individual nodes are responsible for maintaining the functions of one or more applications in case of a failure of any cluster component.

Client A client is a system that can access the application running on the cluster nodes over a local area network (LAN). Clients run a client application that connects to the server (node) where the application runs.

Topology Contains basic cluster components nodes, networks, communication interfaces, and communication adapters.

Resources Logical components or entities that are being made highly available (for example, file systems, raw devices, service IP labels, and applications) by being moved from one node to another. All resources that together form a highly available application or service are grouped in RGs.

PowerHA keeps the RG highly available as a single entity that can be moved from node to node in the event of a component or node failure. RGs can be available from a single node or in the case of concurrent applications, available simultaneously from multiple nodes. A cluster can host more than one RG, thus allowing for efficient use of the cluster nodes.

Service IP label A label that matches to a service IP address and is used for communications between clients and the node. A service IP label is part of an RG, which means that PowerHA can monitor it and keep it highly available.

IP address takeover (IPAT) The process where an IP address is moved from one adapter to another adapter on the same logical network. This adapter can be on the same node or another node in the cluster. If aliasing is used as the method of assigning addresses to adapters, then more than one address can be on a single adapter.

Resource takeover This is the operation of transferring resources between nodes inside the cluster. If one component or node fails because of a hardware or operating system problem, its RGs are moved to another node.

Fallover This represents the movement of an RG from one active node to another node (backup node) in response to a failure on that active node.

Fallback This represents the movement of an RG back from the backup node to the previous node when it becomes available. This movement is typically in response to the reintegration of the previously failed node.

Heartbeat packet A packet that is sent between communication interfaces in the cluster, and is used by the various cluster daemons to monitor the state of the cluster components (nodes, networks, and adapters).

RSCT daemons These consist of two types of processes: topology and group services. PowerHA uses group services, but depends on CAA for topology services. The cluster manager receives event information that is generated by these daemons and takes corresponding (response) actions in case of any failure.

Smart assists A set of high availability agents, called smart assists, are bundled with the PowerHA SystemMirror Standard Edition to help discover and define high availability policies for most common middleware products.

1.5 Fault tolerance versus high availability

Based on the response time and response action to system detected failures, the clusters and systems can belong to one of the following classifications:

•Fault-tolerant systems

•High availability systems

1.5.1 Fault-tolerant systems

The systems that are provided with fault tolerance are designed to operate virtually without interruption, regardless of the failure that might occur (except perhaps for a complete site shutdown because of a natural disaster). In such systems, all components are at least duplicated for both software or hardware.

All components, CPUs, memory, and disks have a special design and provide continuous service, even if one subcomponent fails. Only special software solutions can run on fault-tolerant hardware.

Such systems are expensive and specialized. Implementing a fault-tolerant solution requires much effort and a high degree of customization for all system components.

For environments where no downtime is acceptable (life critical systems), fault-tolerant equipment and solutions are required.

1.5.2 High availability systems

The systems that are configured for high availability are a combination of hardware and software components that are configured to work together to ensure automated recovery in case of failure with minimal acceptable downtime.

In such systems, the software that is involved detects problems in the environment, and manages application survivability by restarting it on the same or on another available machine (taking over the identity of the original machine node).

Therefore, eliminating all single points of failure (SPOF) in the environment is important. For example, if the machine has only one network interface (connection), provide a second network interface (connection) in the same node to take over in case the primary interface providing the service fails.

Another important issue is to protect the data by mirroring and placing it on shared disk areas that are accessible from any machine in the cluster.

The PowerHA software provides the framework and a set of tools for integrating applications in a highly available system. Applications to be integrated in a PowerHA cluster can require a fair amount of customization, possibly both at the application level and at the PowerHA and AIX platform level. PowerHA is a flexible platform that allows integration of generic applications running on the AIX platform, providing for highly available systems at a reasonable cost.

PowerHA is not a fault-tolerant solution and should not be implemented as such.

1.6 Additional PowerHA resources

Here is a list of additional PowerHA resources and descriptions of each one:

•Entitled Software Support (download images)

•PowerHA fixes

•PowerHA, CAA, and RSCT migration interim fixes

•PowerHA wiki

This comprehensive resource contains links to all of the following references and much more.

•PowerHA LinkedIn group

•Base publications

All of the following PowerHA v7 publications are available at IBM Knowledge Center:

– Administering PowerHA SystemMirror

– Developing Smart Assist applications for PowerHA SystemMirror

– Geographic Logical Volume Manager for PowerHA SystemMirror Enterprise Edition

– Installing PowerHA SystemMirror

– Planning PowerHA SystemMirror

– PowerHA SystemMirror concepts

– PowerHA SystemMirror for IBM Systems Director

– Programming client applications for PowerHA SystemMirror

– Quick reference: clmgr command

– Smart Assists for PowerHA SystemMirror

– Storage-based high availability and disaster recovery for PowerHA SystemMirror Enterprise Edition

– Troubleshooting PowerHA SystemMirror

•PowerHA and Capacity Backup

•Videos

Shawn Bodily has several PowerHA related videos on his YouTube channel

•DeveloperWorks Discussion forum

•IBM Redbooks publications

The main focus of each IBM PowerHA Redbooks publication differs a bit, but usually their main focus is covering what is new in a particular release. They generally have more details and advanced tips than the base publications.

Each new publication is rarely a complete replacement for the last. The only exception to this is IBM PowerHA SystemMirror for AIX Cookbook, SG24-7739. It was updated to Version 7.1.3 after replacing two previous cookbooks. It is probably the most comprehensive of all the current IBM Redbooks publications with regard to PowerHA Standard Edition specifically. Although there is some overlap across them, with multiple versions supported, it is important to reference the version of the book that is relevant to the version that you are using.

Figure 1-2 shows a list of relevant PowerHA IBM Redbooks publications. Although it still includes PowerHA 6.1 Enterprise Edition, which is no longer supported, that exact book is still the best reference for configuring EMC SRDF and Hitachi TrueCopy.