Planning an ECE installation
In this chapter, guidelines for how to plan for an ECE installation are provided.
This chapter includes the following topics:
4.1 Sizing considerations
The minimum hardware and network requirements for ECE storage servers are documented in the IBM Spectrum Scale Knowledge Center. However, to size an IBM Spectrum Scale ECE deployment for your workloads, the following factors must be considered:
Required overall capacity, and requirements for expected capacity growth
Required I/O performance
Required redundancy and failure tolerance
Physical space and power
Cost
The following hardware choices influence how a system operates:
Drive type (NVMe, SSD, or HDD)
Node hardware
Number of nodes
Network interconnect
As described in Chapter 1, “Introduction to IBM Spectrum Scale Erasure Code Edition” on page 1, ECE can be deployed in many ways to optimize capacity and performance. NVMe and HDDs can be paired in the same file system, or ECE can be combined with ESS to provide the optimal capacity and performance.
In this section, we explain other factors that must be considered when planning the installation of a system.
4.2 Precheck tools
The following open source precheck tools must be run and return passing results to ensure your ECE configuration features a supported hardware and network configuration:
SpectrumScale_ECE_OS_READINESS
SpectrumScale_ECE_OS_OVERVIEW
SpectrumScale_NETWORK_READINESS
These tools can be found in the parent public GitHub repository.
4.2.1 SpectrumScale_ECE_OS_READINESS helper tool
The SpectrumScale_ECE_OS_READINESS helper tool checks the following attributes of your target hardware to confirm that the minimum hardware requirements are met for each server:
CPU architecture, sockets, and cores
Memory capacity and DIMMs
Operating system levels
Software that is installed on the system
Network NIC and link speed to confirm that it is one of the supported models
Storage adapter to confirm that it is one of the supported models
HDDs, SSD capacity, and confirm that they are SAS drives and in JBOD mode
NVMe CLI software is installed and NVMe drive capacities
Write cache settings on drives and adapters, to confirm that volatile write caching is disabled
Operating system sysctl settings
For more information about and to download the SpectrumScale_ECE_OS_READINESS helper tool, see this GitHub web page.
After the tool is run, a JSON file is generated that includes information that was collected about this system. The file name is the IP address that is passed to the precheck tool.
If you find some issues with the tool, open a defect on the public repository. Also, notice that this open source helper tool does not include a warranty and it is not part of the ECE product from IBM.
Although this helper tool informs you about whether the server passes, the final authority of whether your hardware is suitable for an ECE cluster is the hardware requirements that are documented in IBM Knowledge Center.
4.2.2 SpectrumScale_ECE_OS_OVERVIEW helper tool
The SpectrumScale_ECE_OS_OVERVIEW helper tool is a standalone tool that reviews and consolidates the information in the JSON files that are generated by SpectrumScale_ECE_OS_READINESS tool. It looks for homogeneity of the systems with the assumption that all ECE nodes belong to the same recovery group.
Run this tool only if all of the nodes that you are testing pass the individual tests of SpectrumScale_ECE_OS_READINESS.
It is recommended that you always install ECE by using the IBM Spectrum Scale installation toolkit. When this toolkit is used, it runs the SpectrumScale_ECE_OS_READINESS and SpectrumScale_ECE_OS_OVERVIEW tools.
The following checks are run by SpectrumScale_ECE_OS_OVERVIEW:
All nodes included in the overall test passed the individual test
All nodes have the same:
 – CPU architecture
 – Number of sockets
 – Number cores per socket
 – Number of DIMMs (a failure raises a warning only)
 – Amount of physical memory
 – Network interface
 – Network link speed
 – SAS card model
If the nodes include NVMe drives, all have the same number of drives and capacity
If the nodes include SAS SSD drives, all have the same number of drives and capacity
If the nodes include HDD drives, all have the same number of drives and capacity
At least one fast device (NVMe or SSD) is available per node
At least 12 drives of matching size are evenly distributed across all nodes, and that at least six drives of each drive type are used
No more than 512 drives total are used (a failure raises a warning only)
For more information about how to run this tool and to download the tool, see this GitHub web page.
4.2.3 SpectrumScale_NETWORK_READINESS tool
Another standalone open source tool that was introduced with ECE helps ensure that your planned network meets the ECE network key performance indicators (KPIs). For more information about the required network KPIs that you must pass before installing a supported ECE configuration, see IBM Knowledge Center.
For more information about and to download this tool, see this GitHub web page.
This tool checks across all nodes that are to be part of the ECE cluster for certain metrics that are defined as KPIs and others that are not required, but can be beneficial to know. The tool uses IBM Spectrum Scale’s nsdperf tool to measure bandwidth between nodes, fping to measure ICMP latency, and other network metrics.
 
Note: Passwordless SSH must be configured on all nodes for root user before this tool is run.
The current version of the tool includes the following checks of your ECE network:
Average ICMP latency from each node to the rest of the nodes is less than 1.0 msec on a run test of at least 500 seconds (part of the KPI)
Maximum ICMP latency from each node to the rest of the nodes is less than 2.0 msec on a run test of at least 500 seconds (part of the KPI)
Standard deviation ICMP latency from each node to the rest of the nodes is less than 0.33 msec on a run test of at least 500 seconds (part of the KPI)
Minimum ICMP latency from each node to the rest of the nodes is less than 1.0 msec on a run test of at least 500 seconds (not part of the KPI)
Single node to the rest of the nodes bandwidth is more than 2000 MBps on a run test of at least 20 minutes (part of the KPI)
Half of the nodes to the other half of the nodes bandwidth is more than 2000 MBps on a run test of at least 20 minutes (part of the KPI)
The difference of the bandwidth between the better performing node and worst performing node is less than 20% (part of the KPI)
Multiple packages stats per node (not part of the KPI)
For an ECE installation to be supported, it must be verified that the KPIs where fulfilled before the ECE software was installed.
4.3 Erasure code selection
IBM Spectrum Scale ECE supports four different Erasure codes: 4+2P, 4+3P, 8+2P, and 8+3P in addition to 3- and 4-way replication. Choosing an erasure code involves considering several factors, which are described next.
Data protection and storage utilization
Minimizing the risk of data loss because of multiple failures and minimizing disk rebuilds can be done by using 4+3P or 8+3P encoding at the expense of extra storage overhead. Table 4-1 lists the percentage of total capacity that is available after RAID protection for the various protection types.
Table 4-1 Total capacity (percent) available after ensure code protection of various protection types
Protection Type
Usable capacity
4-Way Replication
~ 25%
3-Way Replication
~ 33%
4+3P
~ 57%
4+2P
~ 67%
8+3P
~ 73%
8+2P
~ 80%
 
Note: These storage efficiency numbers are calculated after allocating spare space and space for storing ECE configuration data and ECE transaction logs (log home VDisks).
Erasure code and file system block size
Restrictions are imposed on what file system block sizes can be used with each erasure code, depending on the device media type. The allowed file system block sizes for each erasure code are listed in Table 4-2.
Table 4-2 Allowed file system block sizes for each Erasure Code
Media type
4+2P
4+3P
8+2P
8+3P
HDD
1M, 2M, 4M, 8M
1M, 2M, 4M, 8M
1M, 2M, 4M, 8M, 16M
1M, 2M, 4M, 8M, 16M
SSD
(NVMe or SAS)
1M, 2M
1M, 2M
1M, 2M, 4M
1M, 2M, 4M
RAID Rebuild
IBM Spectrum Scale RAID runs intelligent rebuilds that are based on several failures on the set of strips that make up any data block in a VDisk. For example, with 8+2P protection, if 1 failure occurs, IBM Spectrum Scale RAID rebuilds the missing data strip. Because data is still protected, this rebuild process occurs in the background and has little effect on file system performance.
If a second failure occurs, IBM Spectrum Scale RAID recognizes that another failure results in data loss. It then begins a critical rebuild to restore data protection. This critical rebuild phase results in performance degradation until at least one level of protection can be restored. Critical rebuild is efficient and fast with less data movement required compared to traditional RAID systems, which minimizes any impacts on system performance.
Nodes in a recovery group
The number of nodes in a recovery group can also affect erasure code selection. A recovery group can contain 4 - 32 nodes, and the level of fault tolerance that can be supported is affected by the number of nodes in the recovery group.
The level of fault tolerance that is provided for each erasure code and the number of recovery group nodes is listed in Table 4-3.
Table 4-3 Recommended Recovery Group Size for each Erasure Code
Number of Nodes
4+2P
4+3P
8+2P
8+3P
4
Not recommended
1 Node
1 Node + 1 Device
Not recommended
2 Devices
Not recommended
1 Node
5
Not recommended
1 Node
1 Node + 1 Device
Not recommended
1 Node
Not recommended
1 Node
6 - 8
2 Nodes
2 Nodes [1]
Not recommended
1 Node
1 Node + 1 Device
9
2 Nodes
3 Nodes
Not recommended
1 Node
1 Node + 1 Device
 
10
2 Nodes
3 Nodes
2 Nodes
2 Nodes
11+
2 Nodes
3 Nodes
2 Nodes
3 Nodes
 
Note: For 7 or 8 nodes, 4+3P is limited to two nodes by recovery group descriptors rather than by the erasure code.
If we consider a 4-node recovery group with 4+2P protection for a specific data block, each node contains one strip of data.
In addition, for each stripe (the data strips plus parity strips for the data block), two nodes contain one strip of parity data. A failure of a node that contains parity and data results in a double-failure for that stripe of data, which causes that stripe to be critical and result in performance degradation during the critical rebuild phase. However, in a 6-node recovery group with the same 4+2P protection, a single node failure results in only one failure to the RAID array.
Although the number of failures that can be tolerated in a smaller recovery group is the same as the number of failures in a larger recovery group, the amount of data that is critical and must be rebuilt for each failure is less for a larger recovery group. For example, with an 8+3P array on an 11 node recovery group, three node failures affects all of the data in the file system.
On a 30-node recovery group, three node failures affect only approximately 10 percent of the data on the file system. Also, the critical rebuild completes more quickly because the rebuild work is distributed across a larger number of remaining nodes.
When planning the erasure code type, also consider future expansion of the cluster and storage utilization. Erasure codes for a VDisks cannot be changed after the VDisk is created, and larger stripe widths feature better storage utilization.
A 4+3P code uses 57 percent of total capacity for usable data; an 8+3P code uses 73 percent of total capacity for usable data. Therefore, rather than creating a 9-node cluster with 4+3P and expanding it in the future, an 11 node cluster that uses 8+3P might be more cost-effective. In some cases, the use of a non-recommended erasure code might be tolerable if the cluster size is planned to be increased soon, and the risks are clearly understood.
4.4 Spare space allocation
When a recovery group and declustered arrays are created by using the mmvdisk command, a default amount of spare space is allocated in each array that is based on the number of drives in the array. Spare space is listed in terms of disk size, but no dedicated spare drives are in an array. The space is distributed across all of the disks in the declustered array.
When a drive in an array fails, the spare space is used to rebuild the data on the failed drive. After spare space is exhausted, no room exists to rebuild and an array cannot return to a fully fault tolerant state if more drive failures occur. Replacing a failed drive returns the spare space back into the array.
In certain cases, it might be useful to increase the spare space in a declustered array. For example, in some data centers, it might not be possible to replace failed hardware in a timely manner. Increasing spare space in these cases reduces the usable space on the system, but it can improve the manageability and increase availability of the system.
The spare space is the minimal disk space that is guaranteed for failed drive rebuild. However, if more free disk space is available that was not yet used by user VDisks, this space can also be used in rebuild because data reliability is the first priority at this time.
4.5 Network planning
The network is the backbone of any IBM Spectrum Scale deployment, which is especially true for ECE. Because ECE spreads data across multiple nodes, a significant amount of node-to-node traffic occurs between all of the nodes in a recovery group.
Consider the case of writing 8 MiB of data from a client to a VDisk in a recovery group with 11 storage nodes where 1 strip of data or parity is written to each node. In this case, the client writes 8 MiB of data to the node that serves the VDisk.
The node calculates parity on the 8 MiB of data; with 8+3P protection 3 MiB of parity is added, so 11 MiB of data must be written to disk. The 1 MiB of data is written to the local node, and the remaining 10 MiB of data is distributed to the 10 other nodes in the recovery group and is transferred over the network.
As a result, 8 MiB of data can generate 18 MiB of network traffic with 8+3P parity (8 MiB from the client to the VDisk server and 10 MiB to other nodes in the recovery group). In addition, before we can complete the write, we must complete two network transfers. Therefore, latency and bandwidth are key to the performance of the file system.
Overall network bandwidth must run the client to ECE bandwidth and the storage bandwidth that is required for ECE. The latency must be as low as possible, especially for NVMe solutions, where network latency can interfere with storage performance.
If a node runs any services or applications other than ECE that have large network requirements (such as Cluster Export Services [CES]), it is highly recommended to use a separate network for these other services. Ensuring that application traffic does not interfere with the IBM Spectrum Scale and ECE communications gives the best performance and reliability.
The network should be designed to be tolerant of failures and outages. Various techniques can be used, including LACP bonding for Ethernet, and multi-interface for InfiniBand. A good starting point is a pair of network switches and dual network ports in each ECE server to provide redundancy.
4.6 IBM Spectrum Scale node roles
When configuring a IBM Spectrum Scale cluster, manager and quorum nodes must be defined. When choosing these nodes in a cluster with ECE, some other considerations must be addressed.
Quorum nodes
IBM Spectrum Scale uses a cluster mechanism called quorum to maintain data consistency if a node failure occurs.
Quorum operates on a simple majority rule, meaning that a majority of quorum nodes in the cluster must be accessible before any node in the cluster can access a file system. This quorum logic keeps any nodes that are cut off from the cluster (for example, by a network failure) from writing data to the file system.
When nodes fail, quorum must be maintained for the cluster to remain online. If quorum is not maintained, IBM Spectrum Scale file systems unmount across the cluster until quorum is reestablished, at which point file system recovery occurs. For this reason, it is important that the set of quorum nodes be carefully considered.
IBM Spectrum Scale can use one of two methods for determining quorum:
Node quorum
Node quorum is the default quorum algorithm for IBM Spectrum Scale. Quorum is defined as one plus half of the defined quorum nodes in the IBM Spectrum Scale cluster. No default quorum nodes exist; you must specify which nodes have this role.
Node quorum with tiebreaker disks
Tiebreaker disks can be used in shared-storage configurations to preserve quorum.
Because clusters that are running ECE do not typically not use shared storage, we normally configure several quorum nodes. It is best to configure an odd number of nodes, with 3, 5, or 7 nodes being the typical numbers used.
If a cluster spans multiple failure domains, such as racks, power domains, or network domains, it is best to allocate quorum nodes from each failure domain to maintain availability. The number of quorum nodes (along with the erasure code selection) determines the maximum number of nodes that can simultaneously fail in the cluster.
It is best to allocate quorum nodes as nodes that do not require frequent restarts or downtime. If possible, avoid nodes that run intensive compute or network loads because these nodes can affect the quorum messages. This issue becomes more important as clusters grow larger in size and the number of quorum messages increase.
Finally, quorum nodes are used to maintain critical configuration data that is stored on the operating system disk in the /var file system. To preserve access to this data, it is best to ensure that any workloads on the quorum node do not overly stress the disk that store the /var file system. The /var file system must be on persistent local storage for each quorum node.
Manager nodes
When defining a IBM Spectrum Scale cluster, we define one or more manager nodes. Manager nodes are used for various internal tasks. For each file system, one manager node is designated as a file system manager. This node is responsible for providing certain tasks, such as file system configuration changes, quota management, and free space management.
In addition, manager nodes are responsible for token management throughout the cluster. Because of the extra load on manager nodes, it is recommended to not run tasks on a manager node that are time sensitive, require real-time response, or that might excessively use the system CPU or cluster network. Any tasks that might slow the IBM Spectrum Scale file system daemon affect the overall response of the file system throughout the cluster.
For large clusters of 100 or more nodes, or clusters where the maxFilesToCache parameter is modified from the default, it is necessary to consider the memory use on manager nodes for token management. Tokens are used to maintain locks and consistency when files are opened in the cluster. The number of tokens in use depends on the number of files that each node can open or cached and the number of nodes in the cluster. For large clusters (generally, 512 nodes or more), it might be beneficial to have dedicated nodes responsible for the manager role.
To determine the overall token memory that is used in a system, an approximation is to examine the maxFilesToCache (default 4000) and maxStatCache (default 1000) for all nodes. Each token uses approximately 512 bytes of memory on a token manager node. For example, a 20 node cluster that uses the default values use (4000 + 1000) tokens * 20 nodes * 512 bytes/token = approximately 49 MB of memory. This memory is distributed across all manager node because all manager nodes share the role of token management. If four manager nodes are used in our example, each manager node is responsible for just over 12 MB of tokens. For fault tolerance, it is best to leave room for a manager node to fail, so we can assume just over 16 MB of memory required.
On small or midsize clusters, the default token memory settings should be adequate. However, in some cases it can be beneficial to increase the maxFilesToCache on nodes to hundreds of thousands or even millions of files. In these cases, it is important to calculate the extra memory requirement and to ensure that any nodes have enough memory beyond the ECE requirements to perform token management tasks.
For optimal and balanced performance, we recommend that you have a uniform workload on each ECE storage node to the degree possible. For this reason, we recommend that all nodes in the recovery group be manager nodes, or none of the nodes be manager nodes.
In storage clusters that are composed of only ECE storage nodes, all nodes are manager nodes. In a large cluster, or a cluster with more than one ECE recovery group, the manager nodes can be on the nodes in one recovery group or on separate nodes altogether.
4.7 Cluster Export Services
In Chapter 2, “IBM Spectrum Scale Erasure Code Edition use cases” on page 13, we describe high-speed file serving by combining ECE with CES. As discussed in that section, CES nodes can be deployed independently of ECE nodes or on the same nodes as ECE.
If CES and ECE are deployed on the same nodes, it is important to size the nodes appropriately. It is recommended to use separate networks for CES and ECE, with protocol traffic on one network, and IBM Spectrum Scale and ECE traffic on a separate network. The IBM Spectrum Scale network always must have at least as much or more available bandwidth than the protocol network.
Memory and CPU on nodes also are increased. The IBM Spectrum Scale FAQ contains the requirements for a CES node. At the time of this writing, it is recommended for a CES node to have at least 64 GB of RAM if serving a single protocol and 128 GB of RAM if serving multiple protocols. This memory is considered in addition to the memory that is required by ECE. For example, if the ECE planning guide recommends 128 GB of RAM, a node must have 256 GB of RAM to support ECE services in addition to running multiple CES protocols.
CES services also require more CPU resources to run on the nodes. More CPU cores are considered to run the CES services.
It is recommended to have uniform workload on each ECE storage node. If you choose to run CES protocols on ECE storage nodes, all nodes in the recovery group must be configured to be CES nodes. If you have multiple recovery groups, you might configure CES on only one set of recovery group nodes.
 
Note: SMB protocol access is limited to 16 CES nodes. If enabling CES with SMB on ECE storage nodes, this configuration limits the recovery group size to 16.
4.8 System management and monitoring
IBM Spectrum Scale provides the following components to assist with monitoring and managing a cluster:
GUI and RESTful API for management and monitoring
Performance monitoring tools
IBM Call Home for problem data collection and transmittal
It is highly recommended to configure all three of these components to assist in the management and monitoring of a cluster. It is recommended to run these services on a node that is not a part of an ECE building block.
For clusters that contain 100 nodes or less, all three of these components typically can run on a single node that is configured to be a part of the cluster. The Installation Toolkit can be used to configure and deploy this node. For more information about sizing this node, see IBM Spectrum Scale Knowledge Center. For larger clusters, helper nodes can be deployed to assist with collecting and distributing messages.
4.9 Other IBM Spectrum Scale components
IBM Spectrum Scale has various other components that provide configuration, monitoring, auditing, and access to the system. These components can be deployed on a system that is running IBM Spectrum Scale ECE. However, in most cases, they must be deployed on nodes that are not a part of an ECE building block. These limitations are in place to ensure that system performance is not affected by another component.
The IBM Spectrum Scale Knowledge Center contains the latest limitations on component interaction.
4.10 Running applications
It is important to consider where applications run in the cluster. Running applications on dedicated client nodes that are attached to ECE storage nodes allows compute capacity or storage capacity to be added independently of each other. Also, running on separate nodes provides the most consistent storage and application performance. The tradeoff is that more hardware is required.
 
Note: Running applications on nodes that are providing ECE storage must be done with care. Because an ECE node can provide storage services to an entire cluster, an application that is competing with ECE for resources on a node can affect operations for an entire cluster.
An ECE server must be sized appropriately when running any applications. CPU and memory must be added, and where possible any shared resources (such as network adapters) are kept separate with one adapter dedicated to application use, and another adapter kept dedicated for IBM Spectrum Scale and ECE traffic.
When sizing CPU and memory, ensure that enough exists for IBM Spectrum Scale and the application. For example, if ECE requires 256 GB of RAM based on workload sizing and an application requires 256 GB of RAM, ensure that the node has at minimum 512 GB of RAM installed. CPU cores also must be increased as needed.
Applications are required to run in an environment that limits resource contention with IBM Spectrum Scale services. Linux cgroups or containers, such as Docker, can provide a way to limit CPU and memory use of applications that run on an ECE storage node.
For more information about running an application on an ECE building block, contact IBM Support.
4.11 File and Object Solution Design Studio tool
If you are unsure as to whether ECE storage is the best solution for your use case, tools are available to help you.
The File and Object Solution Design Studio (also known as FOSDE) assists you in choosing from various IBM storage solutions, such as Spectrum Scale, ESS, Cloud Object Storage (COS), and Spectrum Discover. It also helps to design the solution, including sizing and deployment suggestions that are based on your input.
After the solution is ready, the tool helps you to estimate the performance and provides tips and helpful materials.
For more information, see this web page (log in required).
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset