Planning for Hardware Needs

Sound hardware strategy helps increase system availability while reducing total cost of ownership and improving recovery times. Windows Server 2003 is designed and tested for use with high-performance hardware, applications, and services. To ensure that hardware components are compatible, choose only components that are listed in the Windows Server Catalog (http://www.microsoft.com/windows/catalog/server/) or that are on the Hardware Compatibility List (HCL) (http://www.microsoft.com/hcl/).

Note

All components on the HCL undergo rigorous testing in the Microsoft Windows Hardware Quality Labs (WHQL). The initial testing is 14 days with a 7-day retest for firmware revision, service pack updates, and other minor revisions. Once certified through testing, hardware vendors must maintain the configuration through updates and resubmit for testing and certification. The program requirements and the tight coordination with vendors greatly improve the reliability and availability of Windows Server 2003.

You should standardize on a hardware platform and this platform should have standardized components. Standardization accomplishes the following:

  • Reduces the amount of training needed for support

  • Reduces the amount of testing needed for upgrades

  • Requires fewer spare parts because subcomponents are the same

  • Improves recovery time because problems are easier to troubleshoot

Standardization isn't meant to restrict the datacenter to a single type of server. In an n-tier environment, standardization typically means choosing a standard server configuration for the front-end servers, a standard server configuration for middle-tier business logic, and a standard server configuration for back-end data services. The reason for this is that Web servers, application servers, and database servers all have different resource needs. For example, although a Web server might need to run on a dual-processor system with limited

hardware RAID control and 1 gigabyte (GB) of random access memory (RAM), a database server might need to run on an eight-way system with dual-channel RAID control and 64 GB of RAM.

Standardization isn't meant to limit the organization to a single hardware specification either. Over the life of the datacenter, new equipment will be introduced and old equipment likely will become unavailable. To keep up with the pace of change, new standards and specifications should be implemented when necessary. These standards and specifications, as with the previous standards and specifications, should be published and made available to you.

Redundancy and fault tolerance must be built into the hardware design at all levels to improve availability. You can improve hardware redundancy by using the following components:

  • Clusters Clusters provide failover support for critical applications and services.

  • Standby systems Standby systems provide backup systems in case of total failure of a primary system.

  • Spare parts Spare parts ensure replacement parts are available in case of failure.

  • Fault-tolerant components Fault-tolerant components improve the internal redundancy of the system.

Storage devices, network components, cooling fans, and power supplies all can be configured for fault tolerance. For storage devices, you should be sure to use multiple disk controllers, hot-swappable drives, and redundant drive arrays. For network components, you should look well beyond the network adapter and also consider whether fault tolerance is needed for routers, switches, firewalls, load balancers, and other network equipment.

A standard process for deploying hardware must be defined and distributed to all support personnel. The standard process must include the following:

  • Hardware compatibility and integration testing

  • Hardware support training for personnel

  • Predeployment planning

  • Step-by-step hardware deployment checklists

  • Postdeployment monitoring and maintenance

The following checklist summarizes the recommendations for designing and planning hardware for high availability:

  • Choose hardware that is listed on the HCL.

  • Create and enforce hardware standards.

  • Use redundant hardware whenever possible.

  • Use fault-tolerant hardware whenever possible.

  • Provide a secure physical environment for hardware.

  • Define a standard process for deploying hardware.

If possible, add these recommendations to the preceding checklist:

  • Use fully redundant internal networks from servers to border routers.

  • Use direct peering to major tier-1 telecommunications carriers.

  • Use redundant external connections for data and telephony.

  • Use direct connection with high-speed lines.

Planning for Support Structures and Facilities

The physical structures and facilities supporting your server room are critically important. Without adequate support structures and facilities, you will have problems. The primary considerations for support structures and facilities have to do with the physical environment of the servers. These considerations also extend to the physical security of the server environment.

Just as hardware and software have availability requirements so should support structures and facilities. Factors that affect the physical environment are as follows:

  • Temperature and humidity

  • Dust and other contaminants

  • Physical wiring

  • Power supplies

  • Natural disasters

  • Physical security

Temperature and humidity should be carefully controlled at all times. Processors, memory, hard drives, and other pieces of physical equipment operate most efficiently when they are kept cool; between 65 and 70 degrees Fahrenheit is the ideal temperature in most situations. Equipment that overheats can malfunction or cease to operate altogether otherwise. Servers should have multiple redundant internal fans to ensure these and other internal hardware devices are kept cool.

Tip

You should pay particular attention to fast-running processors and hard drives. Typically, fast processors and hard drives can become overheated and need additional cooling fans—even if the surrounding environment is cool.

Humidity should be kept low to prevent condensation, but the environment shouldn't be dry. A dry climate can contribute to static electricity problems. Antistatic devices and static guards should be used in most environments.

Dust and other contaminants can cause hardware components to overheat or short out. Servers should be protected from these contaminants whenever possible. You should ensure an air filtration system is in place in the server room or hosting facility that is used. The regular preventive maintenance cycle on the servers should include checking servers and their cabinets for dust and other contaminants. If dust is found, the servers and cabinets should be carefully cleaned.

Few things affect the physical environment more than wiring and cabling. All electrical wires and network cables should be tested and certified by qualified technicians. Electrical wiring should be configured to ensure that servers and other equipment have adequate power available for peak usage times. Ideally, multiple dedicated circuits are used to provide power.

Improperly installed network cables are the cause of most communications problems. Network cables should be tested to ensure their operation meets manufacturer specifications. Redundant cables should be installed to ensure availability of the network. All wiring and cabling should be labeled and well maintained. Whenever possible, use cable management systems and tie wraps to prevent physical damage to wiring.

Ensuring servers and their components have power is also important. Servers should have hot-swappable, redundant power supplies. Being hot swappable ensures that the power supply can be replaced without having to turn off the server. Redundancy ensures that one power supply can malfunction and the other will still deliver power to the server. You should be aware that having multiple power supplies doesn't mean that a server or hardware component has redundancy. Some hardware components require multiple power supplies to operate. In this case, an additional (third or fourth) power supply is needed to provide redundancy.

The redundant power supplies should be plugged into separate power strips, and these power strips should be plugged into separate local uninterruptible power supply (UPS) units if other backup power sources aren't available. Some facilities have enterprise UPS units that provide power for an entire room or facility. If this is the case, redundant UPS systems should be installed. To protect against long-term outages, gasor diesel-powered generators should be installed. Most hosting and colocation facilities have generators. But having a generator isn't enough; the generator must be rated to support the peak power needs of all installed equipment. If the generator cannot support the installed equipment, brownouts (temporary outages) will occur.

Tip

Protect equipment against earthquakes

To protect against earthquakes, server racks should have seismic protection. Seismic protection should be extended to other components and to wiring. All cables should be securely attached at both ends and, whenever possible, should be latched to something other than the server, such as a server rack.

Caution

A fire-suppression system should be installed to protect against fire. Dual gasbased systems are preferred, because the systems do not harm hardware when they go off. Water-based sprinkler systems, on the other hand, can destroy hardware.

In addition, access controls should be used to restrict physical access to the server room or facility. Use locks, key cards, access codes, or biometric scanners to ensure only designated individuals can gain entry to the secure area. If possible, use surveillance cameras and maintain recorded tapes for at least a week. When the servers are deployed in a hosting or colocation facility, ensure locked cages are used and that fencing extends from the floor to the ceiling.

The following checklist summarizes the recommendations for designing and planning structures and facilities:

  • Maintain temperature at 65 to 70 degrees Fahrenheit.

  • Maintain low humidity (but not dry).

  • Install redundant internal cooling fans.

  • Use an air filtration system.

  • Check for dust and other contaminants periodically.

  • Install hot-swappable, redundant power supplies.

  • Test and certify wiring and cabling.

  • Use wire management to protect cables from damage.

  • Label hardware and cables.

  • Install backup power sources, such as UPS and generators.

  • Install seismic protection and bracing.

  • Install dual gas-based fire-suppression systems.

  • Restrict physical access by using locks, key cards, access codes, and so forth.

  • Use surveillance cameras and maintain recorded tapes (if possible).

  • Use locked cages, cabinets, and racks at offsite facilities.

  • Use floor-to-ceiling fencing with cages at offsite facilities.

Planning for Day-to-Day Operations

Day-to-day operations and support procedures must be in place before deploying missioncritical systems. The most critical procedures for day-to-day operations involve the following activities:

  • Monitoring and analysis

  • Resources, training, and documentation

  • Change control

  • Problem escalation procedures

  • Backup and recovery procedures

  • Postmortem after recovery

  • Auditing and intrusion detection

Monitoring is critical to the success of business system deployments. You must have the necessary equipment to monitor the status of the business system. Monitoring allows you to be proactive in system support rather than reactive. Monitoring should extend to the hardware, software, and network components but shouldn't interfere with normal systems operations—that is, the monitoring tools chosen should require limited system and network resources to operate.

Note

Keep in mind that too much data is just as bad as not collecting any data. The monitoring tools should gather only the data required for meaningful analysis.

Without careful analysis, the data collected from monitoring is useless. Procedures should be put in place to ensure personnel know how to analyze the data they collect. The network infrastructure is a support area that is often overlooked. Be sure you allocate the appropriate resources for network monitoring.

Resources, training, and documentation are essential to ensuring you can manage and maintain mission-critical systems. Many organizations cripple the operations team by staffing minimally. Minimally manned teams will have marginal response times and nominal effectiveness. The organization must take the following steps:

  • Staff for success to be successful.

  • Conduct training before deploying new technologies.

  • Keep the training up-to-date with what's deployed.

  • Document essential operations procedures.

Every change to hardware, software, and the network must be planned and executed deliberately. To do this, you must have established change control procedures and well-documented execution plans. Change control procedures should be designed to ensure that everyone knows what changes have been made. Execution plans should be designed to ensure that everyone knows the exact steps that were or should be performed to make a change.

Change logs are a key part of change control. Each piece of physical hardware deployed in the operational environment should have a change log. The change log should be stored in a text document or spreadsheet that is readily accessible to support personnel. The change log should show the following information:

  • Who changed the hardware

  • What change was made

  • When the change was made

  • Why the change was made

Tip

Establish and follow change control procedures

Change control procedures must take into account the need for both planned changes and emergency changes. All team members involved in a planned change should meet regularly and follow a specific implementation schedule. No one should make changes that aren't discussed with the entire implementation team.

You should have well-defined backup and recovery plans. The backup plan should specifically state the following information:

  • When full, incremental, differential, and log backups are used

  • How often and at what time backups are performed

  • Whether the backups must be conducted online or offline

  • The amount of data being backed up as well as how critical the data is

  • The tools used to perform the backups

  • The maximum time allowed for backup and restore

  • How backup media is labeled, recorded, and rotated

Backups should be monitored daily to ensure they are running correctly and that the media is good. Any problems with backups should be corrected immediately. Multiple media sets should be used for backups, and these media sets should be rotated on a specific schedule. With a four-set rotation, there is one set for daily, weekly, monthly, and quarterly backups. By rotating one media set offsite, support staff can help ensure that the organization is protected in case of a disaster.

The recovery plan should provide detailed step-by-step procedures for recovering the system under various conditions, such as procedures for recovering from hard disk drive failure or troubleshooting problems with connectivity to the back-end database. The recovery plan should also include system design and architecture documentation that details the configuration of physical hardware, application logic components, and back-end data. Along with this information, support staff should provide a media set containing all software, drivers, and operating system files needed to recover the system.

Note

One thing administrators often forget about is spare parts. Spare parts for key components, such as processors, drives, and memory, should also be maintained as part of the recovery plan.

You should practice restoring critical business systems using the recovery plan. Practice shouldn't be conducted on the production servers. Instead, the team should practice on test equipment with a configuration similar to the real production servers. Practicing once a quarter or semiannually is highly recommended.

You should have well-defined problem escalation procedures that document how to handle problems and emergency changes that might be needed. Many organizations use a threetiered help desk structure for handling problems:

  • Level 1 support staff form the front line for handling basic problems. They typically have hands-on access to the hardware, software, and network components they manage. Their main job is to clarify and prioritize a problem. If the problem has occurred before and there is a documented resolution procedure, they can resolve the problem without escalation. If the problem is new or not recognized, they must understand how, when, and to whom to escalate it.

  • Level 2 support staff include more specialized personnel that can diagnose a particular type of problem and work with others to resolve a problem, such as system administrators and network engineers. They usually have remote access to the hardware, software, and network components they manage. This allows them to troubleshoot problems remotely and to send out technicians once they've pinpointed the problem.

  • Level 3 support staff include highly technical personnel who are subject matter experts, team leaders, or team supervisors. The level 3 team can include support personnel from vendors as well as representatives from the user community. Together, they form the emergency response or crisis resolution team that is responsible for resolving crisis situations and planning emergency changes.

All crisis situations and emergencies should be responded to decisively and resolved methodically. A single person on the emergency response team should be responsible for coordinating all changes and executing the recovery plan. This same person should be responsible for writing an after-action report that details the emergency response and resolution process used. The after-action report should analyze how the emergency was resolved and what the root cause of the problem was.

In addition, you should establish procedures for auditing system usage and detecting intrusion. In Windows Server 2003, auditing policies are used to track the successful or failed execution of the following activities:

  • Account logon events Tracks events related to user logon and logoff

  • Account management Tracks those tasks involved with handling user accounts, such as creating or deleting accounts and resetting passwords

  • Directory service access Tracks access to the Active Directory directory service

  • Object access Tracks system resource usage for files, directories, and objects

  • Policy change Tracks changes to user rights, auditing, and trust relationships

  • Privilege use Tracks the use of user rights and privileges

  • Process tracking Tracks system processes and resource usage

  • System events Tracks system startup, shutdown, restart, and actions that affect system security or the security log

You should have an incident response plan that includes priority escalation of suspected intrusion to senior team members and provides step-by-step details on how to handle the intrusion. The incident response team should gather information from all network systems that might be affected. The information should include event logs, application logs, database logs, and any other pertinent files and data. The incident response team should take immediate action to lock out accounts, change passwords, and physically disconnect the system if necessary. All team members participating in the response should write a postmortem that details the following information:

  • What date and time they were notified and what immediate actions they took

  • Who they notified and what the response was from the notified individual

  • What their assessment of the issue is and the actions necessary to resolve and prevent similar incidents

The team leader should write an executive summary of the incident and forward this to senior management.

The following checklist summarizes the recommendations for operational support of highavailability systems:

  • Monitor hardware, software, and network components 24/7.

  • Ensure monitoring doesn't interfere with normal systems operations.

  • Gather only the data required for meaningful analysis.

  • Establish procedures that let personnel know what to look for in the data.

  • Use outside-in monitoring any time systems are externally accessible.

  • Provide adequate resources, training, and documentation.

  • Establish change control procedures that include change logs.

  • Establish execution plans that detail the change implementation.

  • Create a solid backup plan that includes onsite and offsite tape rotation.

  • Monitor backups and test backup media.

  • Create a recovery plan for all critical systems.

  • Test the recovery plan on a routine basis.

  • Document how to handle problems and make emergency changes.

  • Use a three-tier support structure to coordinate problem escalation.

  • Form an emergency response or crisis resolution team.

  • Write after-action reports that detail the process used.

  • Establish procedures for auditing system usage and detecting intrusion.

  • Create an intrusion response plan with priority escalation.

  • Take immediate action to handle suspected or actual intrusion.

  • Write postmortem reports detailing team reactions to the intrusion.

Planning for Deploying Highly Available Servers

You should always create a plan before deploying a business system. The plan should show everything that must be done before the system is transitioned into the production environment. Once in the production environment, the system is deemed operational and should be handled as outlined in the section entitled "Planning for Day-to-Day Operations" earlier in this chapter.

The deployment plan should include the following items:

  • Checklists

  • Contact lists

  • Test plans

  • Deployment schedules

Checklists are a key part of the deployment plan. The purpose of a checklist is to ensure that the entire deployment team understands the steps they need to perform. Checklists should list the tasks that must be performed and designate individuals to handle the tasks during each phase of the deployment—from planning to testing to installation. Prior to executing a checklist, the deployment team should meet to ensure that all items are covered and that the necessary interactions among team members are clearly understood. After deployment, the preliminary checklists should become a part of the system documentation and new checklists should be created any time the system is updated.

The deployment plan should include a contact list. The contact list should provide the name, role, telephone number, and e-mail address of all team members, vendors, and solution provider representatives. Alternative numbers for cell phones and pagers should be provided as well.

The deployment plan should include a test plan. An ideal test plan has several phases. In Phase I, the deployment team builds the business system and support structures in a test lab. Building the system means accomplishing the following tasks:

  • Creating a test network on which to run the system

  • Putting together the hardware and storage components

  • Installing the operating system and application software

  • Adjusting basic system settings to suit the test environment

  • Configuring clustering or network load balancing as appropriate

The deployment team can conduct any necessary testing and troubleshooting in the isolated lab environment. The entire system should undergo burn-in testing to guard against faulty components. If a component is flawed, it usually fails in the first few days of operation. Testing doesn't stop with burn-in. Web and application servers should be stress tested. Database servers should be load tested. The results of the stress and load tests should be analyzed to ensure the system meets the performance requirements and expectations of the customer. Adjustments to the configuration should be made to improve performance and optimize for the expected load.

In Phase II, the deployment team tests the business system and support equipment in the deployment location. They conduct similar tests as before but in the real-world environment. Again, the results of these tests should be analyzed to ensure the system meets the performance requirements and expectations of the customer. Afterward, adjustments should be made to improve performance and optimize as necessary. The team can then deploy the business system.

After deployment, the team should perform limited, nonintrusive testing to ensure the system is operating normally. Once Phase III testing is completed, the team can use the operational plans for monitoring and maintenance.

The following checklist summarizes the recommendations for predeployment planning of mission-critical systems:

  • Create a plan that covers the entire testing to operations cycle.

  • Use checklists to ensure the deployment team understands the procedures.

  • Provide a contact list for the team, vendors, and solution providers.

  • Conduct burn-in testing in the lab.

  • Conduct stress and load testing in the lab.

  • Use the test data to optimize and adjust the configuration.

  • Provide follow-on testing in the deployment location.

  • Follow a specific deployment schedule.

  • Use operational plans once final tests are completed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset