Running Server Clusters

Cluster Administrator (Cluadmin.exe) provides the graphical interface for managing, monitoring, and configuring server clusters. Its command-line counterpart is Cluster.exe. Both tools use the Cluster API to manage the Cluster service.

The Cluster Service and Cluster Objects

The Cluster service is responsible for all aspects of server cluster operation and also maintains the cluster database. The Cluster service uses objects to control the physical and logical units within the cluster. Many types of cluster objects are defined, including those pertaining to the following components:

  • Cluster networks

  • Cluster interfaces

  • Nodes

  • Cluster resources

  • Resource types

  • Groups

Cluster objects have properties that define their behavior within the cluster. The Cluster API contains the control codes and management functions needed to manage the object through the Cluster service. As shown in Figure 18-12, each node in a cluster runs an instance of the Cluster service (Clussvc.exe), the Cluster Network Driver (Clusnet.sys), and the Cluster

Overview of cluster administration

Figure 18-12. Overview of cluster administration

Disk Driver (Clusdisk.sys). The Cluster Network Driver is responsible for the following activities:

  • Providing reliable, guaranteed communication between nodes

  • Monitoring network paths between nodes

  • Routing cluster messages

  • Detecting communication failure

Each node's Cluster Network Driver periodically exchanges messages called heartbeats with other active nodes. The heartbeat is a UDP packet that is sent between cluster nodes. If a node fails to respond to a heartbeat message, the Cluster Network Driver on the node that detects the failure notifies the Cluster service.

Each node's Cluster Disk Driver is responsible for maintaining exclusive ownership of shared disks. Only the node that owns the physical disk resource can access the disk. All other nodes cannot access the disk resource. The Cluster Disk Driver also is responsible for replacing reservations on disks for the local system.

The Cluster Heartbeat

The Cluster service transmits heartbeat messages on a dedicated network adapter, called the cluster adapter, to other computers in the server cluster. The number of nodes in the server cluster determines how these additional network adapters are connected. With a four-node cluster and standard Ethernet cabling, the dedicated network adapters are normally connected to a dedicated hub or switch. For redundancy, communications can be transmitted over multiple networks as well.

As the name implies, the heartbeat is used to track the condition of each node in the cluster. If the Cluster service doesn't receive a heartbeat from a server in the cluster within a specified time, the service assumes the server has failed and initiates failover. The Cluster service uses the concept of virtual servers to specify groups of resources that fail over together. Failover occurs when a clustered resource fails on one server and another server takes over management of the resource. When the failed resource is restored, the original server is able to regain control of the resource and come back online. The process of returning to service is called failback.

The Cluster Database

The heartbeat isn't the only traffic transmitted between clusters. The clusters also exchange synchronization and management data. Most management information is stored in the cluster database. This database contains information on the configuration of the cluster and the resources it uses.

The cluster database contains information on all physical and logical elements in the cluster, referred to as cluster objects, as well as configuration data. The Cluster service maintains the database by using global updates and periodic check pointing. Global updates are used to replicate changes across all nodes. Any changes that the Cluster service fails to replicate to all nodes are logged to a recovery log. These changes are synchronized at a subsequent checkpoint.

The Cluster Quorum Resource

Every cluster has a single resource that is responsible for maintaining the recovery logs. This resource is called the quorum resource. The quorum resource writes information on all cluster database changes to the recovery logs, ensuring that the cluster configuration and state data can be recovered. The importance of the quorum resource is evident in any failover situation. Consider the following scenario:

  1. Nodes in a cluster are using the quorum resource and then node 1 fails. Nodes 2, 3, and 4 continue to operate. Node 2 takes over resources of the failed node.

  2. Node 2 writes configuration changes to the recovery logs.

  3. Node 2 fails before node 1 comes back online. Nodes 3 and 4 take over the resources of the failed nodes.

  4. Shortly afterward, node 1 comes back online and must update its private copy of the cluster database with the changes made by node 2.

  5. The Cluster service uses the quorum resource's recovery logs to synchronize changes and perform the configuration updates. Node 1 is then able to rejoin and regain control of its resources.

The only standard cluster resource that can act as a quorum resource is the Physical Disk resource. Developers can create their own quorum resource types for resources, provided those resources have the following characteristics:

  • Enable a single node to gain physical control of and maintain control of the resource

  • Provide physical storage that can be accessed by any node in the cluster

  • Use NTFS

The Cluster Interface and Network States

The network adapter used to transfer cluster management and state data is referred to as the cluster adapter. Traffic between nodes in the cluster is transmitted over the cluster network, which is typically a private network used only by the cluster nodes. To determine failure, the Cluster service tracks the status of the cluster adapter interface and the cluster network.

The cluster adapter interface states are shown in Table 18-2. Administrators can use the CLUSTER NETINTERFACE command or Cluster Administrator to check the interface state.

Table 18-2. Cluster Adapter Interface States

Network Interface State

Description

Up

The normal operation state. The interface is active and can communicate with all other interfaces on the network (except those that are Failed or Unavailable).

Unknown

The state cannot be determined at this time.

Unavailable

The interface is disabled for cluster use or the node associated with the network interface is down.

Unreachable

The node cannot communicate through the interface. The reason is unknown.

Failed

The node associated with the interface is active but cannot communicate through its interface. The Cluster service has isolated the error to the interface as determined by failure to receive heartbeats from the node and receipt of hardware failure notifications from an adapter that supports Network Driver Interface Specification (NDIS).

The network states are shown in Table 18-3. Administrators can use the CLUSTER NETWORK command or Cluster Administrator to check network state.

Table 18-3. Cluster Network States

Network State Description

Up

The normal operation state. The network is functioning normally.

Unknown

The state cannot be determined at this time.

Unavailable

The network is disabled for cluster use or all the nodes attached to the network are inactive.

Partitioned

The network has partially failed. Some active clusters cannot communicate with one another over the network.

Down

The network has failed. None of the active clusters can communicate with another using the network.

When a network interface enters the Failed state, the Cluster service triggers failover of all IP Address resources that use the network interface. The Cluster service does not do this when a network interface is unreachable. When the interface is unreachable, the Cluster service cannot isolate the problem in a way that is sufficient to implement a recovery policy. Additionally, if the interface is in the Unavailable state, the Cluster service assumes the node is down.

The cluster network should normally be in the Up state. When in the Up state, the cluster network is working normally and all active nodes are communicating. If the network enters the Partitioned state, it means one or more of the nodes is having communication problems or has recently failed. The Down state indicates the cluster network has failed and isn't functioning. In the Down state, clusters cannot communicate with each other over this network.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset