Chapter 40. Disaster Planning

Ask three different people what their idea of a disaster is and you'll probably get three different answers. For most administrators, the term disaster probably means any scenario in which one or more essential system services cannot operate and the prospects for quick recovery are less than hopeful—that is, a disaster is something a service reset or system reboot won't fix.

To ensure that operations can be restored as quickly as possible in a given situation, every network needs a clear disaster recovery plan. In this chapter, I'm not going to mince words and try to explain why you need to plan for disasters. Instead, I'm going to focus on what you need to do to get ready for the inevitable, because worst-case scenarios can and do happen. I'm also going to discuss predisaster preparation procedures.

Preparing for a Disaster

Chapter 17, went into detail about planning for highly available, scalable, and manageable systems. Many of the same concepts go into disaster planning. Why? Because, at the end of the day, disaster planning involves implementing plans that ensure the availability of systems and services. Remember that part of disaster planning is applying some level of contingency planning to every essential network service and system. You need to implement problem escalation and response procedures. You also need a standing problem-resolution document that describes in great detail what to do when disaster strikes.

Developing Contingency Procedures

You should identify the services and systems that are essential to network operations. Typically, this list will include the following components:

  • Network infrastructure servers running Active Directory, Domain Name System (DNS), Dynamic Host Configuration Protocol (DHCP), Terminal Services, and Routing and Remote Access Service (RRAS).

  • File, database, and application servers, such as servers with essential file shares or those that provide database or e-mail services

  • Networking hardware, including switches, routers, and firewalls

Use Chapter 17 to help you develop plans for contingency procedures in the following areas:

  • Physical security Place network hardware and servers in a locked, secure access facility. This could be an office that is kept locked or a server room that requires a passkey to enter. When physical access to network hardware and servers requires special access privileges, you prevent a lot of problems and ensure that only authorized personnel can get access to systems from the console.

  • Data backup Implement a regular backup plan that ensures that multiple datasets are available for all essential systems, and that these backups are stored in more than one location. For example, if you keep the most current backup sets on-site in the server room, you should rotate another backup set to off-site storage. In this way, if disaster strikes, you will be more likely to be able to recover operations.

  • Fault tolerance Build redundancy into the network and system architecture. At the server level, you can protect data using a redundant array of independent disks (RAID) and guard against component failure by having spare parts at hand. These precautions protect servers at a very basic level. For essential services such as Active Directory, DNS, and DHCP, you can build in fault tolerance by deploying redundant systems using techniques discussed throughout this book. These same concepts can be applied to network hardware components such as routers and switches.

  • Recovery Every essential server and network device should have a written recovery plan that details step by step what to do to rebuild and recover it. Be as detailed and explicit as possible and don't assume that the readers know anything about the system or device they are recovering. Do this even if you are sure that you'll be the one performing the recovery—you'll be thankful for it, trust me. Things can and do go wrong at the worst times, and sometimes, under pressure, you might forget some important detail in the recovery process—not to mention that you might be unavailable to recover the system for some reason.

  • Power protection Power-protect servers and network hardware using an uninterruptible power supply (UPS) system. Power protection will help safeguard servers and network hardware from power surges and dirty power. Power protection will also help prevent data loss and allow you to power down servers in an appropriate fashion through manual or automatic shutdown.

Implementing Problem Escalation and Response Procedures

As part of planning, you need to develop well-defined problem escalation procedures that document how to handle problems and emergency changes that might be needed. You need to designate an incident response team and an emergency response team. Although the two teams could consist of the same team members, the teams differ in fundamental ways.

  • Incident response team The incident response team's role is to respond to security incidents, such as the suspected cracking of a database server. This team is concerned with responding to intrusion, taking immediate action to safeguard the organization's information, documenting the security issue thoroughly in an after-action report, and then fixing the security problem so that the same type of incident cannot recur. Your organization's security administrator or network security expert should have a key role in this team.

  • Emergency response team The emergency response team's role is to respond to service and system outages, such as the failure of a database server. This team is concerned with recovering the service or system as quickly as possible and allowing normal operations to resume. Like the incident response team, the emergency response team needs to document the outage thoroughly in an after-action report, and then, if applicable, propose changes to improve the recovery process. Your organization's system administrators should have key roles in this team.

Creating a Problem Resolution Policy Document

Over the years, I've worked with and consulted for many organizations, and I've often been asked to help implement information technology (IT) policy and procedure. In the area of disaster and recovery planning, there's one policy document that I always use, regardless of the size of the company I am working with. I call it the problem resolution policy document.

The problem resolution policy document has the following six sections:

  • Responsibilities The overall responsibilities of IT and engineering staff during and after normal business hours should be detailed in this section. For an organization with 24/7 operations, such as a company with a public World Wide Web site maintained by internal staff, the after-hours responsibilities section should be very detailed and let individuals know exactly what their responsibilities are. Most organizations with 24/7 operations will designate individuals as being "on call" 7 days a week, 365 days a year, and in that case, this section should detail what being "on call" means, and what the general responsibilities are for an individual on call.

  • Phone roster Every system and service that you've identified in your planning as essential should have a point of contact. For some systems, you'll have several points of contact. Consider, for example, a database server. You might have a system administrator who is responsible for the server itself, a database administrator who is responsible for the database running on the server, and an integration specialist responsible for any integration components running on the server.

Note

The phone roster should include both on-site and off-site contact numbers. Ideally, this means that you'll have the work phone number, cell phone number, and pager number of each contact. It should be the responsibility of every individual on the phone roster to ensure that contact information is up to date.

  • Key contact information In addition to a phone roster, you should have contact numbers for facilities and vendors. The key contacts list should include the main office phone numbers at branch offices and data centers, and contact numbers for the various vendors that installed infrastructure at each office, such as the building manager, Internet service provider (ISP), electrician, and network wiring specialist. It should also include the support phone numbers for hardware and software vendors and the information you'll be required to give in order to get service, such as customer identification number and service contract information.

  • Notification procedures The way problems get resolved is through notification. This section should outline the notification procedures and the primary point of contact in case of outage. If many systems and services are involved, notification and primary contacts can be divided into categories. For example, you may have an external systems notification process for your public Internet servers and an internal systems notification process for your intranet services.

  • Escalation When problems aren't resolved within a specific timeframe, there should be clear escalation procedures that detail whom to contact and when. For example, you might have level 1, level 2, and level 3 points of contact, with level 1 contacts being called immediately, level 2 contacts being called when issues aren't resolved in 30 minutes, and level 3 contacts being called when issues aren't resolved in 60 minutes.

    Note

    You should also have a priority system in place that dictates what types of incidents or outages take precedence over others. For example, you could specify that service-level outages, such as those that involve the complete system have priority over an isolated outage involving a single server or application, but that suspected security incidents have priority over all other issues.

  • Post-action reporting Every individual involved in a major outage or incident should be expected to write a post-action report. This section details what should be in that report. For example, you would want to track the notification time, actions after notification, escalation attempts, and other items that are important to improving the process or preventing the problem from recurring.

Every IT group should have a general policy with regard to problem resolution procedures, and this policy should be detailed in a problem resolution policy document or one like it. The document should be distributed to all relevant personnel throughout the organization, so that every person who has some level of responsibility for ensuring system and service availability knows what to do in case of an emergency. After you implement the policy, you should test it to help refine it so that the policy will work as expected in an actual disaster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset