Chapter . High Availability

Twenty-Four Seven

Network availability (commonly referred to as “high availability”) is the design and measurement of a network in terms of the accessibility of the network services. The network must be available to access the services on it. High availability simply refers to the goal to keep the network available all of the time. Nonstop networking services is another term. Network uptime describes the availability of the network.

Designing networks for high availability

  • Prevents financial loss

  • Prevents productivity loss

  • Reduces reactive support costs

  • Improves customer satisfaction and loyalty

Businesses measure their network downtime in terms of average cost per hour. For example, if a portion of a credit-card transaction network goes down such that businesses are unable to swipe credit cards for sales, the credit-card company might end up losing millions of dollars per hour.

Devices such as ATM machines, web services, and automated check-in machines at airports require constant availability. If these machines are down, then they cannot conduct business and revenue is affected.

Common terms for discussing availability are “24x7x365” and “five 9s.” The phrase 24x7x365 refers to keeping the network up 24 hours a day, 7 days a week, 365 days a year (366 in leap years). This demand reflects several trends:

  • Businesses are international. While people in the US sleep, their coworkers in Australia and Japan conduct business, and they need access to network resources.

  • Web presence lets companies keep their shops “open” 24 hours a day.

Five 9s refers to the measurement of availability in terms of a percentage: 99.999 percent. This measurement implies that the network is available 99.999 percent of the time (and not available for .001 percent of the time). This type of measurement made sense in the mainframe world (where its use began) in which it measured a set of hosts. However, today’s networks are distributed and consist of hundreds and thousands of devices.

In terms of availability, the following table shows how the measurement translates into downtime per year.

Five 9s availability means the network is unavailable for a total of 5 minutes per year. Yeesh! How do you design a network such that devices are always working?

First, what contributes to the unavailability of the network?

  • Human error

  • Failed devices

  • Bugs

  • Power outages

  • Service provider outages

  • Natural disasters

  • Backhoes

  • Acts of war or terror

  • Upgrades, scheduled maintenance, or hardware replacements

Notice that most of these examples are unplanned and thus generally outside the control of the network administrator. Human error tends to be the leading cause of network outages. It is the design of the network that allows (or prohibits) the network to be available during these planned and unplanned network outages.

Practices for Avoiding Downtime

Reaching any sort of constant uptime does not happen if the following factors exist:

  • Single points of failure

  • Outages resulting from hardware and software upgrades

  • Long recovery times for reboots or switchovers

  • Lack of tested spare hardware on site

  • Long repair times due to a lack of troubleshooting guides or a lack of training

  • Excessive environmental conditions

  • Redundancy failure (failure not detected, redundancy not implemented)

  • High probability of double failures

  • Long convergence time for rerouting traffic around a failed trunk or router in the core

Because outages do occur, the goal of the network administrators is to reduce the outage to as short a time as possible.

The following design practices increase network availability.

Concept

Example

Hardware redundancy

You achieve redundancy with redundant hardware, processors, and line cards; devices acting in parallel; and the ability to hot-swap cards without interrupting the device’s operation (online insertion and removal, or OIR).

Software availability features

Availability features include Hot Standby Router Protocol (HSRP), nonstop forwarding, spanning trees, line-card switchovers, fast route processor switchovers, and nondisruptive upgrades.

Network and server redundancy

Redundant data centers mirror each other so if one data center (with its servers, databases, and networking gear) becomes unavailable, the network automatically reroutes to a redundant data center with minimal data loss.

Link and carrier availability

Carrier availability comes from multihoming servers, multiple link connections between switches and routers, and subscriptions to several different service providers.

Clean implementation, cable management

You can take steps to minimize the chances of human error. Cleanly implementing a network (by labeling cables, tying cables down, using simple network designs and up-to-date network diagrams, etc.) helps prevent human error.

Backup power and temperature management

  • Using uninterruptible power supplies (UPSs) on primary network and server equipment ensures that when the power goes out, you have an alternative power source to keep the devices operational. UPSs vary in that they can provide enough power to keep devices running for days or weeks or just enough power for the devices to keep running during quick power surges.

  • Keeping devices in temperature-controlled rooms (as opposed to hot boiler rooms or the cold outdoors) ensures that extreme temperature and moisture do not contribute to an outage.

Network monitoring

Monitoring the network, servers, and devices allows network administrators to detect problems or outages quickly, which minimizes network downtime. The goal is to detect problems before they affect the network’s ability to pass traffic. Admins typically use network-management software to monitor the network as well detect trends.

Reduction of network complexity

Selecting a simple, logical, and repetitive network design over a complex one simplifies troubleshooting and network growth. It also reduces the chances of human errors. This step includes using standard released software, well-tested features (as opposed to bleeding-edge technology), and good design sense.

Change control management

Change-control management is the process of introducing changes to the network in a controlled and monitored way. This step includes testing changes before moving them onto the production network, researching software upgrades for known bugs, making a back-out plan in case a change causes a failure, and making one change at a time.

Training

Nothing is more important than a properly trained staff. This step significantly reduces human error by eliminating mistakes made out of ignorance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset