Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter . High Availability

Twenty-Four Seven

Network availability (commonly referred to as “high availability”) is the design and measurement of a network in terms of the accessibility of the network services. The network must be available to access the services on it. High availability simply refers to the goal to keep the network available all of the time. Nonstop networking services is another term. Network uptime describes the availability of the network.

Designing networks for high availability

Prevents financial loss
Prevents productivity loss
Reduces reactive support costs
Improves customer satisfaction and loyalty

Businesses measure their network downtime in terms of average cost per hour. For example, if a portion of a credit-card transaction network goes down such that businesses are unable to swipe credit cards for sales, the credit-card company might end up losing millions of dollars per hour.

Devices such as ATM machines, web services, and automated check-in machines at airports require constant availability. If these machines are down, then they cannot conduct business and revenue is affected.

Common terms for discussing availability are “24x7x365” and “five 9s.” The phrase 24x7x365 refers to keeping the network up 24 hours a day, 7 days a week, 365 days a year (366 in leap years). This demand reflects several trends:

Businesses are international. While people in the US sleep, their coworkers in Australia and Japan conduct business, and they need access to network resources.
Web presence lets companies keep their shops “open” 24 hours a day.

Five 9s refers to the measurement of availability in terms of a percentage: 99.999 percent. This measurement implies that the network is available 99.999 percent of the time (and not available for .001 percent of the time). This type of measurement made sense in the mainframe world (where its use began) in which it measured a set of hosts. However, today’s networks are distributed and consist of hundreds and thousands of devices.

In terms of availability, the following table shows how the measurement translates into downtime per year.

Five 9s availability means the network is unavailable for a total of 5 minutes per year. Yeesh! How do you design a network such that devices are always working?

First, what contributes to the unavailability of the network?

Human error
Failed devices
Bugs
Power outages
Service provider outages
Natural disasters
Backhoes
Acts of war or terror
Upgrades, scheduled maintenance, or hardware replacements

Notice that most of these examples are unplanned and thus generally outside the control of the network administrator. Human error tends to be the leading cause of network outages. It is the design of the network that allows (or prohibits) the network to be available during these planned and unplanned network outages.

Practices for Avoiding Downtime

Reaching any sort of constant uptime does not happen if the following factors exist:

Single points of failure
Outages resulting from hardware and software upgrades
Long recovery times for reboots or switchovers
Lack of tested spare hardware on site
Long repair times due to a lack of troubleshooting guides or a lack of training
Excessive environmental conditions
Redundancy failure (failure not detected, redundancy not implemented)
High probability of double failures
Long convergence time for rerouting traffic around a failed trunk or router in the core

Because outages do occur, the goal of the network administrators is to reduce the outage to as short a time as possible.

The following design practices increase network availability.

Concept	Example
Hardware redundancy	You achieve redundancy with redundant hardware, processors, and line cards; devices acting in parallel; and the ability to hot-swap cards without interrupting the device’s operation (online insertion and removal, or OIR).
Software availability features	Availability features include Hot Standby Router Protocol (HSRP), nonstop forwarding, spanning trees, line-card switchovers, fast route processor switchovers, and nondisruptive upgrades.
Network and server redundancy	Redundant data centers mirror each other so if one data center (with its servers, databases, and networking gear) becomes unavailable, the network automatically reroutes to a redundant data center with minimal data loss.
Link and carrier availability	Carrier availability comes from multihoming servers, multiple link connections between switches and routers, and subscriptions to several different service providers.
Clean implementation, cable management	You can take steps to minimize the chances of human error. Cleanly implementing a network (by labeling cables, tying cables down, using simple network designs and up-to-date network diagrams, etc.) helps prevent human error.
Backup power and temperature management	Using uninterruptible power supplies (UPSs) on primary network and server equipment ensures that when the power goes out, you have an alternative power source to keep the devices operational. UPSs vary in that they can provide enough power to keep devices running for days or weeks or just enough power for the devices to keep running during quick power surges. Keeping devices in temperature-controlled rooms (as opposed to hot boiler rooms or the cold outdoors) ensures that extreme temperature and moisture do not contribute to an outage.
Network monitoring	Monitoring the network, servers, and devices allows network administrators to detect problems or outages quickly, which minimizes network downtime. The goal is to detect problems before they affect the network’s ability to pass traffic. Admins typically use network-management software to monitor the network as well detect trends.
Reduction of network complexity	Selecting a simple, logical, and repetitive network design over a complex one simplifies troubleshooting and network growth. It also reduces the chances of human errors. This step includes using standard released software, well-tested features (as opposed to bleeding-edge technology), and good design sense.
Change control management	Change-control management is the process of introducing changes to the network in a controlled and monitored way. This step includes testing changes before moving them onto the production network, researching software upgrades for known bugs, making a back-out plan in case a change causes a failure, and making one change at a time.
Training	Nothing is more important than a properly trained staff. This step significantly reduces human error by eliminating mistakes made out of ignorance.

At-A-Glance—High Availability

<division> <title>Why Should I Care About High Availability?</title>

A highly available network means that a network and its applications are both operational and accessible at all times. As more businesses use networking to conduct day-to-day business, networking becomes a critical to business. To put this in perspective, look at the cost of a network outage. The following numbers reflect the cost of one hour of downtime for various business functions:

ATM fees: $14,000
Package shipping: $28,000
Teleticket sales: $69,000
Airline sales: $89,500
Catalog sales: $90,000
Credit-card authorization: $2.6 million
Brokerage operations: $6.24 million

Designing a network for high availability does the following:

Prevents financial loss
Prevents productivity loss
Reduces reactive support costs
Improves customer satisfaction and loyalty

</division><division> <title>What Affects Network Availability?</title>

The following three types of errors are the most common causes of network failures:

Operational errors account for 40 percent of network failures; they are usually the result of poor change-management processes or a lack of training and documentation.
Network failures account for 30 percent of network failures, and they include single points of failure.
Software failures account for 30 percent of network failures. They can be caused by software crashes, unsuccessful switchovers, or latent-code failures.

</division><division> <title>How Do You Measure Availability?</title>

The two most common methods for measuring availability are “number of 9s” and defects per million (DPM). Number of 9s refers to the measurement of availability in terms of a percentage. For example, five 9s implies that the network is available 99.999 percent of the time (and not available for .001 percent of the time). Although this measurement is still common, it is really a hold-over from the mainframe world, which measured only the availability of the mainframe hosts. Modern networks, however, are distributed and consist of hundreds and thousands of devices. In this case, DPM is a more realistic measurement. DPM is the number of defects per million hours of operations.

Availability	DPM	Downtime per Year
99.000%	10,000	3 days, 15 hours, 36 minutes
99.900%	1000	8 hours, 46 minutes
99.990%	100	53 minutes
99.999%	10	5 minutes

</division><division> <title>Best Practices</title>

Hardware redundancy means redundant hardware, processors, line cards, and links. You should design the network such that critical hardware (e.g. core switches) has no single points of failure. Hardware availability also allows you to hot-swap cards or other devices without interrupting the device’s operation (online insertion and removal).

</division><division> <title>Reduction of Network Complexity</title>

Although some redundancy is good (and necessary), overdoing it can cause more problems than it solves. Selecting a simple, logical, and repetitive network design over a complex one simplifies the availability to troubleshoot and grow the network. There is a trade-off between expenses and risk. A good design maintains the proper balance between the two extremes.

</division><division> <title>Software Availability</title>

Software availability refers to both reliability-based protocols, such as Spanning Tree Protocol (STP) and Hot Standby Router Protocol (HSRP), and reliable code and nondisruptive upgrades.

STP, HSRP, and other protocols provide instructions to the network and to components of the network on how to behave in the event of a failure. Failure could be a power outage, a hardware failure, a disconnected cable, or any number of things. These protocols provide rules to reroute packets and reconfigure paths. Convergence is the process of applying these rules to the resolution of any such network errors. A converged network is one that, from a user standpoint, has recovered from a failure and can process instructions and requests.

You should thoroughly test and use software in a real (quarantined) or simulated real environment before putting it on the network. Avoid “bleeding-edge” or inadequately tested code. You should also follow procedures for introducing new or updated code. Shutting down the network, loading new code, and hoping it all works is usually a bad idea. You should first introduce new code on segmented, noncritical parts of the network. Plan for the worst case when loading new code.

</division><division> <title>Link and Carrier Availability</title>

Another component in building highly available networks is understanding your service provider’s plans and policies for network availability. For business-critical applications, it might be worthwhile to purchase a secondary service from an additional service provider. You can sometimes use this second link for load sharing.

</division><division> <title>Clean Implementation and Cable Management</title>

This best practice might seem like a waste of time when first implementing a network, but disorganized cabling and poor implementation can increase the probability of network disasters and hinder their timely resolution.

You can save time, grief, and money by taking some simple steps, such as labeling cables, tying cables down, using simple network designs, and keeping up-to-date network diagrams.

</division><division> <title>Network Monitoring</title>

Monitoring the network servers and devices allows network administrators to detect problems or outages quickly, which contributes to minimizing network downtime. The goal is to detect problems before they affect the network’s ability to pass traffic. You typically use network-management software to monitor the network.

</division><division> <title>Network and Server Redundancy</title>

Redundant data centers mirror each other such that if one data center (with its servers, databases, and networking gear) becomes unavailable, the network automatically reroutes to a redundant data center with minimal data loss.

</division><division> <title>Training</title>

Nothing is more important than a properly trained staff. Up-to-date, comprehensive training can significantly reduce failures. Human error will always exist, but you can limit that exposure through documentation, good design practices, and training. Human error can be forgiven; ignorance cannot.

</division><division> <title>Change-Control Management</title>

Always expect the worst when first installing upgrades. It saves you time and trouble in the long run. Introduce all changes to the network in a controlled way. This method includes testing changes before moving them onto the production network, researching software upgrades for known bugs, and following a backup plan in case the change causes a failure or doesn’t implement correctly.

</division>

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for High Availability

Create new playlist

Sign In

Sign Up

Chapter . High Availability

Twenty-Four Seven

Practices for Avoiding Downtime

Table of Contents for
High Availability