IBM Spectrum Scale RAID technical overview
In this chapter, we present an overview of IBM Spectrum Scale RAID, which is the core technology that is used in IBM Spectrum Scale Erasure Code Edition.
We start with a definition of terms, followed by a discussion of declustered RAID, and the key software concepts and components that differentiate ECE from other software defined storage technologies.
This chapter includes the following topics:
3.1 Definitions of IBM Spectrum Scale RAID
The following words and phrases are associated with IBM Spectrum Scale RAID:
Disk: A block storage device, including NVMe drives, solid-state drives (SSDs), and hard disk drives (HDDs).
Storage Server: An IBM Spectrum Scale cluster node that features several disks that are available to it, and serves abstractions that are based on those disks.
Pdisk: An abstraction of a disk that encompasses all the physical paths and properties of the disk.
Track: A RAID stripe, also a full GPFS file system block.
Recovery group (RG): A recovery group is a collection of pdisks and servers. File system NSDs called VDisks might be created within a recovery group, and might be configured to include various levels of data protection, including tolerance and correction of disk errors, and tolerance and recovery of disk and server failures. A recovery group is also referred to as an ECE building block.
Server set: All the servers within a recovery group, which is also known as an ECE node class.
Declustered array (DA): A declustered array is a subset of the pdisks within a recovery group that all share similar characteristics, such as size and speed. A recovery group might contain multiple declustered arrays, which cannot overlap (that is, a pdisk must belong to exactly one declustered array).
VDisk: An erasure code-protected virtual NSD that is partitioned among the pdisks of a declustered array of a recovery group, and served by one of the recovery group servers.
Log home VDisk: A special VDisk that is used to store the recovery group internal transaction log, such as event log entries, updates to VDisk configuration data, and certain data write operations quickly. It is often created from a declustered array with fast devices, such as NVMes or SSDs.
Log group (LG): A subset of the VDisks within a recovery group that all share a transaction log to only one log home VDisk. It is the smallest unit of failure recovery in a recovery group. All the VDisks in the same log group must failover and recovery together. A recovery group can contain multiple log groups, which cannot overlap (that is, a VDisk must belong to exactly one log group). All of the VDisks in a log group are served by one of the recovery group servers.
Root log group: A special log group that allocates resources, hosts VDisk configuration data, and responds to commands for the entire recovery group. The root log group contains only a log home VDisk, which is used to ensure that the VDisk configuration data is updated atomically.
mmvdisk: The command suite for simplified IBM Spectrum Scale RAID administration.
VDisk set: A VDisk set is a collection of VDisks with identical sizes and attributes, one in each log group across one or more recovery groups. VDisk sets are externally managed according to the conventions of the mmvdisk command. With VDisk sets, the mmvdisk command creates IBM Spectrum Scale file systems that are striped uniformly across all log groups.
NSD: The abstraction of a file system disk that is used by IBM Spectrum Scale. A VDisk NSD is an IBM Spectrum Scale NSD built from an IBM Spectrum Scale ECE recovery group.
File system: An IBM Spectrum Scale file system is striped across a collection of NSDs.
Recovery group configuration manager (RGCM): RGCM assigns log groups to servers, manages recovery and failover, and directs Spectrum Scale clients to the node currently serving a specific VDisk NSD.
3.2 Software RAID
The IBM Spectrum Scale RAID software in ECE uses local serial-attached SCSI (SAS) or NVMe drives. Because RAID functions are handled by the software, ECE does not require an external RAID controller or acceleration hardware.
3.2.1 RAID codes
IBM Spectrum Scale RAID in ECE supports two and three fault tolerant RAID codes. The two-fault tolerant codes include 8 data plus 2 parity, 4 data plus 2 parity, and 3-way replication. The three-fault tolerant codes include 8 data plus 3 parity, 4 data plus 3 parity, and 4-way replication. Figure 3-1 shows example RAID tracks consisting of data and parity strips.
Figure 3-1 RAID tracks
3.2.2 Declustered RAID
IBM Spectrum Scale RAID distributes data and parity information across node failure domains to tolerate unavailability or failure of all pdisks in a node. It also distributes spare capacity across nodes to maximize parallelism in rebuild operations.
IBM Spectrum Scale RAID implements end-to-end checksums and data versions to detect and correct the data integrity problems of traditional RAID.
Figure 3-2 on page 22 shows a simple example of declustered RAID. The left side shows a traditional RAID layout that consists of three 2-way mirrored RAID volumes and a dedicated spare disk that uses seven drives is shown. The right side shows the equivalent declustered layout, which still uses seven drives. Here, the blocks of the three RAID volumes and the spare capacity are scattered over the seven disks.
Figure 3-2 Declustered array versus 1+1 array
Figure 3-3 shows a significant advantage of declustered RAID layout over traditional RAID layout after a drive failure. With the traditional RAID layout on the left side of Figure 3-3, the system must copy the surviving replica of the failed drive to the spare drive, reading only from one drive and writing only to one drive. However, with the declusterd layout that is shown on the right of Figure 3-3, the affected replicas and the spares are distributed across all six surviving disks. This configuration rebuilds reads from all surviving disks and writes to all surviving disks, which greatly increases rebuild parallelism.
Figure 3-3 Array rebuild operation
A second advantage of the declustered RAID technology that is used by IBM Spectrum Scale ECE (and in IBM ESS) is that it minimizes the worst-case number of critical RAID tracks in the presence of multiple disk failures. ECE can then deal with restoring protection to critical RAID tracks as a high priority, while giving lower priority to RAID tracks that are not considered critical.
For example, consider an 8+3p RAID code on an array of 100 pdisks. In the traditional layout and declustered layout, the probability that a specific RAID track is critical is 11/100 * 10/99 * 9/98 (0.1%). However, when a track is critical in the traditional RAID array, all tracks in the volume are critical, whereas with declustered RAID, only 0.1%, of the tracks are critical. By prioritizing the rebuild of more critical tracks over less critical tracks, ECE quickly gets out of critical rebuild and then can tolerate another failure.
ECE adapts these priorities dynamically; if a “non-critical” RAID track is used and more drives fail, this RAID track’s rebuild priority can be escalated to “critical”.
A third advantage of declustered RAID is that it makes it possible to support any number of drives in the array and to dynamically add and remove drives from the array. Adding a drive in a traditional RAID layout (except in the case of adding a spare) requires significant data reorganization and restriping. However, only targeted data movement is needed to rebalance the array to include the added drive in a declustered array.
3.2.3 Fault-tolerance
When a VDisk set is created, the user selects one of the supported two or three fault tolerant Reed Solomon or replicated erasure codes. This choice determines the number of failures of each type the system can tolerate.
A simultaneous hard failure of individual disks that exceeds the VDisk fault tolerance can result in data loss. The failure of too many nodes results only in temporary data unavailability (assuming the contents of the disks in the failed node are not lost).
To ensure fault-tolerant data access, IBM Spectrum Scale Erasure Code Edition places the strips of RAID tracks across failure boundaries. The placement allows for survival from concurrent storage rich servers or disk failures.
The placement algorithm is aware of the hardware grouping of disks, which are present in individual storage servers and attempts to segregate the individual strips of RAID tracks across as many servers and disks as possible. For example, if a VDisk was created with four-way replication, each replica of the VDisk’s four-way track can be placed on a separate storage server. If a storage server fails, the surviving redundancy replicas on other servers ensure continuity of service.
Figure 3-4 on page 24 shows a sample track placement for a VDisk that is uses RAID redundancy code 4+3P (four data strips and three parity strips). Strips 1 - 4 are data strips and strips 5 - 7 are parity strips of the track. The system balances the strips across the servers such that each server is guaranteed to hold at least one strip, while only two servers hold two strips and no server holds more than two strips. ECE guarantees 1 node plus 1 disk drive fault tolerance in this configuration.
 
Note: When mixing VDisk sets of different fault tolerances within the same ECE building block, the availability of all VDisks in the building block can be limited by the VDisk set with lowest fault tolerance. For example, suppose the building block consists of 12 nodes. One VDisk set uses a three fault tolerant code while another uses a two fault tolerant code. If three pdisks spread across nodes were to fail simultaneously, the VDisks with two fault tolerant code might report data loss while the three fault tolerant VDisks survive. However, if three nodes fail instead, the two-fault tolerant and three-fault tolerant VDisks become unavailable until at least one of the nodes comes up.
Figure 3-4 4+3P track strips across servers
By segregating each strip across as wide a set of disk groups as possible, ECE ensures that the loss of any set of disk groups up to the fault tolerance of the RAID redundancy code is survivable.
Figure 3-5 on page 25 shows an example of the same configuration after the loss of a server before a rebuild operation. In this example, the loss of server 2 makes strips 2 and 5 unavailable. These unavailable strips are rebuilt with help of other parity and data strips. The fault-tolerant placement of individual strips across multiple servers ensured that at least four strips survived.
Figure 3-5 4+3P track strips across servers after one server failure
3.3 End-to-end checksum and data versions
IBM Spectrum Scale Erasure Code Edition protects all data that is written to disk, and data that is passing over the network between Spectrum Scale client nodes and ECE storage nodes with strong (64-bit) checksums. If on-disk data becomes corrupted, ECE detects the corruption, uses the erasure code to compute the correct data, and repairs the corrupted on-disk data. If data is corrupted over the network between nodes, ECE detects the corruption and retransmits the data.
In addition to checksums, ECE records a version number with the modified on-disk data whenever on-disk data is modified. It also tracks that version number in the vdisk metadata.
If a disk write is silently dropped, ECE detects that the data version does not match the expected value when reading the data back from disk, and uses the erasure code to compute the correct data and repair the on-disk data.
Other RAID solutions use only T10 DIF with its weak 16-bit checksum and generally cannot detect dropped writes. In addition to ECE’s strong checksums and version numbers, ECE uses T10 DIF when available.
 
3.4 Integrity Manager
The Integrity Manager is a software component of IBM Spectrum Scale RAID that maintains data resiliency. It dynamically adjusts data layout to maintain fault tolerance and routinely verifies data correctness on disk drives. The layout adjustment operations are split into “Rebuild” and “Rebalance” phases, while data integrity is verified during the “Scrub” phase. These phases run sequentially.
Consider the following points:
Rebuild is responsible for data migration when pdisks fail or become unavailable. It migrates data to spare space distributed over the other disks of the array to restore fault tolerance. When creating a declustered array, ECE specifies the minimum amount of space in the array to reserve as spare for rebuild. Unlike other RAID solutions that designate complete drives as spares, ECE distributes spare space equally among all disks in the array to maximum rebuild parallelism.
 
Note: The user can increase the spare space beyond ECE default value to set aside extra space to withstand more drive failures while maintaining fault tolerance after rebuild completes.
Rebalance migrates data in a declustered array to balance data among the pdisks. When a failed pdisk in the array is replaced or when pdisks are added to the array, rebalance moves data into the newly added space. Similarly, if disks were unavailable for a long time, rebuild migrated the data to other disks, and the unavailable disks come back online, rebalance migrates data back to those disks.
Scrub is a background task that slowly cycles through all VDisks in a declustered array and verifies the on-disk data and parity information. The purpose of scrub is to find and correct defects in cold data before enough of these defects accumulate to exceed the fault tolerance of the VDisks. Scrub runs at low priority relative to file system I/O so that it does not affect performance. When file system I/O is light, scrub paces itself so that by default it takes two weeks to complete a scrub cycle on each declustered array.
3.5 Disk hospital
The disk hospital monitors the health of physical disk drives. It analyzes errors that are reported by the operating system, repairs medium errors, measures disk error rates and performance, power-cycles drives to repair some connectivity problems, and decides when a disk must be replaced. A human is needed only to replace the drive after the hospital determines that the drive is defective.
The disk hospital features the following main responsibilities:
Analyze errors that are reported by the operating system and determine whether they are connectivity problems, disk medium errors, or other disk problems.
Facilitate correction of medium errors.
Monitor SMART data and react to SMART trips.
Measure long-term uncorrectable read error rate.
Measure disk performance and identify slow disks.
Determine when a drive is defective and prepare the drive for replacement.
If the operating system reports an I/O error against a physical disk, the disk hospital puts the disk into a “diagnosing” state and begins a series of tests. While the disk is diagnosed, ECE reconstructs reads from parity and defers writes by marking strips “stale”. Stale strips are automatically readmitted if the disk is placed back into service.
If other drive failures prevent reconstruction, or more than one strip of a track were marked stale, ECE waits for the hospital to finish its diagnosis before issuing I/Os to the disk.
If the disk hospital finds disk medium errors, it repairs the errors by using the RAID layer to reconstruct the data and then overwrites the affected disk blocks with the reconstructed data.
Most modern drives include built-in, self-monitoring analysis and reporting technology (SMART). The disk hospital polls these drives for SMART predicted failures (also called smart trips) after any error, and at least every 24 hours. If the drive reports an impending failure, the disk hospital places the drive into “failing” state, drains all data from the disk to distributed spare space, and prepares the drive for replacement.
The disk hospital uses a patented algorithm to measure the uncorrectable read error rate of drives (also called the bit error rate). If the error rate exceeds the manufacturer’s specified rate, the hospital puts the pdisk into failing state and prepares it for replacement.
On every I/O operation, the disk hospital collects performance information. If a few drives exhibit poor performance compared to the average for the array over tens of thousands of I/O requests, the hospital puts the under-performing drives into “slow” state, and prepares them for replacement.
If the disk hospital finds that disk blocks can no longer be written, it puts the disk into “read-only” state. If the disk hospital finds that the disk suffered a complete internal failure, it puts the disk into “dead” state. In both cases, the disk hospital prepares the disk for replacement.
The disk hospital is careful not to mark drives bad in response to communication problems, such as a defective cable. Such problems can affect many drives, which easily exceeds the fault tolerance of the VDisks.
3.6 Storage hardware software interface
ECE interfaces with platform-specific disk bays to determine the physical locations of pdisks and control power and indicator lights. It provides this interface by using the following external commands:
The tslsenclslot command is used to inventory and determine the status of disk slots.
The tsctlenclslot command is used to control lights and power-on disk slots. Supported slot lights are power, identify, replace, and fail.
Both of these programs are accompanied by a corresponding platform dependent backend implementation, which allows ECE to integrate with various storage hardware and storage management interfaces.
3.7 IBM Spectrum Scale RAID software component layout
As shown in Figure 3-6, a recovery group is composed of a set of storage-rich servers with symmetrical configurations. All available disk drives in these servers (except system boot drives) belong to this recovery group.
Figure 3-6 IBM Spectrum Scale RAID software component layout
Multiple recovery groups can exist in the same IBM Spectrum Scale cluster and file system. The disk drives are grouped into different declustered arrays according to their characteristics, with each declustered array consisting of a set of matching drives. Usually, the disk space in a recovery group is evenly divided into different VDisk sets and the VDisks are evenly grouped into multiple user log groups.
A log home VDisk and one or more user VDisks (used as IBM Spectrum Scale NSDs) exist in each log group. Each server features two user log groups in the normal case. During server failure, the two log groups that are hosted by the failing server are automatically moved to two different available nodes.
The root log group is a special log group that allocates resources and hosts VDisk configuration data for the whole recovery group. From this perspective, when the term recovery group is used, it also refers to the root log group. The root log group is lightweight and typically does not use much system resources. The recovery group configuration manager (RGCM) is responsible for balancing the log groups across different nodes in the recovery group server set during cluster startup and failure recovery.
3.8 Start up sequence for recovery group and log groups
RGCM is a software component that tracks and assigns management responsibilities for all recovery groups and their associated log groups. RGCM is always on the cluster manager node. RGCM is responsible for the following startup operations for each recovery group in the cluster:
During ECE cluster startup, designate a node within the ECE cluster and start a recovery group on this node.
During ECE cluster startup, coordinate with each recovery group and assign an owner node to each of their log groups.
Monitor and handle any failures of the recovery groups and log groups. If a failure occurs, reassign the owner nodes and reschedule the failed recovery for the recovery groups or log groups.
Track ownership of the recovery group and the log groups among the nodes within the ECE cluster.
How RGCM starts the recovery group and associated log groups is shown in Figure 3-7.
Figure 3-7 RGCM roles during cluster startup
The nodes that host the recovery group and RGCM can also serve log groups. For simplicity, the example that is shown in Figure 3-7 does not display the log groups for these nodes.
As a recovery group starts, it scans and determines that log groups 1 and 2 must be started. The recovery group sends requests to the RGCM instance to designate nodes to serve the log groups.
As each log group starts, it activates its associated VDisks and brings the corresponding NSD disks online. The mmvdisk command can be used to list all log groups, list the log groups that are served by each node, and examine the VDisk status within each log group.
3.9 Recovery of recovery group and log groups
Recovery group and log group failures can result from a server crash, daemon crash, or from temporary loss of network or disk access on the designated recovery group or log group server. If a recovery group failure occurs, RGCM is notified of the failure, selects a node, and reschedules the recovery for the failed recovery group.
If a log group failure occurs, the corresponding recovery group is notified of the failure and coordinates with RGCM to select a new node to serve that log group. The recovery procedure for recovery groups and log groups is also used when bringing an ECE node down for planned maintenance and for rebalancing the number of log groups on each ECE storage node.
Figure 3-8 shows the handling of a recovery group failure on node B, and the following recovery on node C.
Figure 3-8 Recovery group failure and recovery
Figure 3-9 shows the handling of the failure of log group 1 on node C, and the following recovery on node B.
Figure 3-9 log group failure and recovery
3.10 ECE read and write strategies
This section describes various strategies the ECE RAID layer uses to perform reads and writes that are received from the file system layer. Read operations in ECE are relatively straightforward. However, write operations use four different strategies, depending on the size of the operation and the amount of data that is cached in the RAID track.
By using different strategies for each case, ECE RAID minimizes latency or the total number of physical I/O operations that must be done to complete the operation.
3.10.1 Reads
When the ECE RAID layer receives a read request from the file system, it first checks if the requested blocks are cached. If the blocks are cached, ECE returns them. If they are not cached, ECE reads the data strips for the requested block from the appropriate pdisks and verifies the data checksums and version information. If no problems are found, ECE returns the aggregated data strips to the file system layer.
If checksum errors, failed pdisks or stale bits are detected, or some of the required data strips are not available, ECE reads other data and parity strips and reconstructs the data. If the number of unreadable strips exceeds the VDisk fault tolerance, the reconstruction cannot be done and ECE returns a read error back to the file system.
 
3.10.2 Full track writes
Full track writes (also known as full block writes) are the most efficient case for ECE. In this strategy, ECE allocates a new free physical RAID track (ptrack), computes the parity strips from the data, and writes the data and parity strips to the new track. Then, it logs a record to the VDisk metadata log that changes the track location from the old ptrack track to the new ptrack and frees the old ptrack.
By writing to an unused ptrack on disk and changing the track location, updating data and parity strips remains atomic because no modifications were committed to the track until the VDisk metadata update is written to the log.
3.10.3 Promoted full track writes
If a write modifies most of a track, ECE reads the unmodified portion of the track and performs a full track write.
3.10.4 Medium writes
If the percentage of the track that is modified is greater than or equal to the fast write limit and less than nsdRAIDMediumWriteLimitPct parameter (default 50 percent), the write is performed as a medium write. This operation is also called an “in-place” write.
This case is tricky because the updates to data and parity must be made atomic, even though the in-place updates cannot be done atomically. The process includes the following steps:
1. Read the contents of the portions of the track that are to be modified; that is, the old data and parity.
2. Execute a parity update operation to calculate the new parity strips from the old data strips, old parity strips, and new data to be applied.
3. Write the intended changes to a redo log known as atomic update log (AU log).
4. Apply the changes to the track.
The medium write case adds several overheads in that it must read the old data, and write the new data to the atomic update log and to the track. It also must write to the VDisk metadata log. However, this method is used when these overheads are less than that of a promoted full track write.
3.10.5 Small writes
ECE uses the small write strategy if the size of the write in bytes is less than the small write limit. In the small write case, the data is written to the “fast write” log. After the data is successfully committed to the log, ECE returns success back to the file system. Because the log is placed on fast storage, this operation has low latency, which makes small writes fast.
The write is then flushed back to the RAID track in background by using the promoted full track write strategy or the medium write strategy, depending on what percentage of the affected track was modified at the time of the flush.
3.10.6 Deferred writes and stale strips
In all four of the write cases, if an affected pdisk is in a transiently unavailable state (such as “diagnosing”), ECE can defer the write by marking the unavailable strip “stale”. If the pdisk later becomes available, a “readmit” process applies the deferred writes, which brings the pdisk up-to-date with the RAID track. This optimization allows ECE to achieve high performance with low latency, even when transient disk problems are occurring.
If more than one pdisk in the RAID track is transiently unavailable, ECE must mark multiple strips stale. In this case, it waits for the pdisks to stabilize before completing the write.
If the number of stably unavailable strips in the track exceed the VDisk fault tolerance such that parity strips cannot be computed, the write fails and returns an error back to the file system.
In addition, if some of these pdisks are in “missing” state (that is, unavailable but with expectation that they eventually become available), ECE resigns service of the log group and periodically attempts recovery waiting for the missing pdisks to become available.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset