The overall objective of a BIA is to identify the impact of outages. More specifically, the goal is to identify the critical functions that can affect the organization. After the critical functions have been identified, the critical resources that support these functions can be identified.
Each resource has an MAO and an impact if it fails. The ultimate goal is to identify the recovery requirements. FIGURE 12-3 shows these overall steps: Input is gathered from process owners and experts to help identify the CBFs and the critical resources that support them, the impact and MAO of the resources are identified, and the recovery requirements from the MAO are determined.
An indirect objective of the BIA is to justify funding. After the recovery requirements in the BIA have been identified, controls to support these requirements in the BCP are identified. If the impact is high, spending money to prevent the outage would be cost effective.
NIST SP 800-34 Rev. 1 includes a diagram similar to that in FIGURE 12-4. It shows the relationship between costs and the time of an outage. The line labeled Cost of Disruption indicates that the cost of a disruption is very low immediately after an outage occurs. However, as the outage time increases, the cost of the disruption also increases. The other line identifies the Cost to Recover from an outage. To be able to recover from an outage almost immediately, the costs are high. However, if a longer outage time is acceptable, the costs to recover are lower.
For example, if a website that sells products online fails for 60 seconds, the cost of the disruption would be very low, whereas, if the outage lasts for days, the cost of the disruption would be very high. Controls can be implemented that allow the website to recover immediately from a 60-second outage, but the cost would be very high. In contrast, spending very little money on recovery controls would result in a longer outage time.
These considerations help to identify the optimum cost point. At this point, the company would spend the minimum amount on recovery controls while still minimizing the costs of disruption.
The following sections cover the objectives of a BIA in more detail.
To an IT specialist, the CBFs are not always apparent. For example, a security expert may not know the CBFs of a website. To him or her, the web server would be an obvious component, but there are others.
Interviewing or surveying the experts can help in gaining insight into all the components that support the web server, and identifying the underlying steps of CBFs is often useful. For example, the following list details the steps of an online website purchase:
Identifying CBFs first is a top-down approach followed by identifying the critical IT services and infrastructure that support the CBFs. A bottom-up approach would likely miss important elements, for example, trying to determine what functions a server supports.
In this example, the CBFs are:
With this information, the critical resources can be identified.
Critical resources are those that are required to support the CBFs. Once the CBFs have been identified, they can be analyzed to determine the critical resources for each.
The example of the website shows how to identify critical resources from the CBFs. One of the website CBFs is the customer’s accessing the website. The following IT resources are required to support this function:
This isn’t the only way the process could be designed. The product database could be hosted on a server in the DMZ so that data could be retrieved more quickly, and the customer database could be hosted separately in the internal network. The DMZ could be designed differently too. Many design possibilities exist, which is why asking the experts is important. They will know how the process is configured.
The second CBF is the web server’s ability to access the database server. The database server hosts both product and customer information. The customer information is used when a customer makes a purchase and to target advertising for the returning customer. The following IT resources are required to support this function:
The third CBF is the order processing application. It needs to receive orders from the database server and be able to track the order until delivery. The following IT resources are required to support this function:
In many instances, the critical resources will overlap. In other words, a critical resource required for one function may also be required for another function. For example, the web server is required for two of the functions, and facility support (e.g., power, heating, and air-conditioning) is required for all of them.
A resource can be listed one time for all the functions or with each function. In the case of IT resources, listing each of the resources with each of the functions is the better idea. For example, all IT resources require facility support. They could be listed one time as follows:
Once the CBFs and IT resources that support them have been identified, the next item to consider is the MAO and impact. The MAO is also referred to as the maximum tolerable period of disruption (MTPD). The MAO helps to determine which CBFs need to be recovered and restarted as soon as possible after a disaster, identifies the specific resources needed to restart the CBF, and helps to determine how soon these systems need to be recovered.
The other consideration in this process is the impact on the business, which is monetary but doesn’t need to be expressed as money. Instead, the impact is often expressed as a relative value such as high, medium, or low, but it can also be expressed as a number, such as 1 through 4.
Once the impact level has been identified, it can be matched with an MAO. TABLE 12-1 shows an example of how impact value levels can be defined in an organization. Each level is matched to the MAO to identify how long the system can be down before the impact is felt.
IMPACT VALUE LEVEL | MAXIMUM ACCEPTABLE OUTAGE AND IMPACT |
---|---|
Level 1 Business functions must be available during all business hours. Online systems must be available 24 hours a day, seven days a week. |
Two hours Any outage will have an almost immediate impact on the business. |
Level 2 Business processes can survive without the business function for a short time. |
One day If the outage lasts more than a day, it will have an impact on the business. |
Level 3 Business processes can survive without the business functions for one or more days. |
Three days The outage won’t have an impact on the business if the outage lasts as long as three days. |
Level 4 Business processes can survive without the business functions for extended periods. |
One week The outage won’t have a significant impact on the business unless it lasts longer than a week. |
When calculating the MAO for an organization, both direct and indirect costs must be considered.
The MAO values are assigned internally by each organization, which means that the values and recovery objectives used by one organization can be completely different from those used by another organization.
The direct costs are usually easier to calculate than the indirect costs. Some of these costs are readily apparent, and others are not. The following list shows some of the direct costs:
Identifying indirect costs is more difficult than identifying direct costs, but they must be identified because their value also affects the impact value. The following list shows some of the indirect costs that need to be considered:
The recovery requirements establish the time frame in which systems must be recoverable and identify the data that must be recovered. For example, some data loss might be acceptable, whereas other data loss is not.
Two primary terms related to recovery requirements are recovery time objective (RTO) and recovery point objective (RPO). Although the RTO applies to systems or functions, the RPO applies only to data. More specifically, the RPO addresses data housed in databases.
The RTO is the time in which the system or function must be recovered and would be equal to or less than the MAO. For example, if the MAO is one hour, the RTO needs to be one hour or less.
The RPOs identify the maximum amount of data loss an organization can accept, which is the acceptable data latency. For example, a database may record hundreds of sales transactions a minute. The organization may need to recover this data up to the moment of failure, which would be an expensive process, but, because each transaction represents revenue, the cost is justified. On the other hand, another database may import data only once a week; to ensure nothing is lost, the only data that would need to be restored would be the data that had been added since the last import, which is a much less expensive process.
RTO can also be thought of as time critical and RPO as mission critical. The RTO identifies the time when the system is restored, and the RPO identifies data that is mission critical. Some processes must be restored in a timely manner, which requires a short RTO. Restoration of other processes can be delayed as long as all the data is recovered.
Although lower RTOs are achievable, they are much more expensive. Therefore, when interviewing stakeholders, connecting the cost with the RTO is important. For example, ensuring that a database is recoverable up to the moment of failure is possible, even after major disasters, such as earthquakes, but would require a separate site, separate servers, and immediate data replication, all of which are expensive. Once stakeholders recognize the costs, they may decide on a different RTO.
After the MAO has been identified, identifying the recovery time objectives becomes easy. TABLE 12-2 adds an additional column, the recovery objectives, to Table 12-1 and shows that the recovery objective is directly related to the MAO.
IMPACT VALUE LEVEL | MAXIMUM ACCEPTABLE OUTAGE | RECOVERY OBJECTIVE |
---|---|---|
Level 1 Business functions must be available during all business hours. Online systems must be available 24 hours a day, seven days a week. |
Two hours Any outage will have an almost immediate impact on the business. |
Two hours or less Functions in this category must be recovered in less than two hours. |
Level 2 Business processes can survive without the business function for a short amount of time. |
One day If the outage lasts more than a day, it will have an impact on the business. |
24 hours or less Functions in this category must be recovered within 24 hours. |
Level 3 Business processes can survive without the business functions for one or more days. |
Three days The outage won’t have an impact on the business if the outage lasts as long as three days. |
72 hours or less Functions in this category must be recovered within 72 hours. |
Level 4 Business processes can survive without the business functions for extended periods. |
One week The outage won’t have a significant impact on the business unless it lasts longer than a week. |
Seven days or less Functions in this category must be recovered within one week. |
Looking at impact value level 4, the question could be asked, if an organization can do without a function for up to a week, why include it in the BIA at all? This level can be thought of as encompassing minor desirable functions. Although the organization won’t fail without them, it would be able to operate with fewer problems with them functional. For example, an organization may not use Internet access for mission-critical tasks, but having Internet access may make it easier for employees to perform other jobs.
The RPO isn’t calculated directly from the MAO. Instead, personnel will need to be interviewed to determine what data loss is acceptable, which would vary in different types of databases. Commonly, acceptable data loss is measured in minutes, such as 15 minutes.
A database used to record sales can’t accept much data loss because every minute of data loss represents lost sales revenue. On the other hand, other databases may not change as much, and their changes may be manually reproduced. If there aren’t many changes and they can easily be reproduced, more data loss can be accepted. For example, a database is manually updated about five times a week, and the updates have a paper trail that shows what needs to be reproduced. Therefore, data loss of a week could easily be accepted in the database. Because the updates have a paper trail, the database can be restored and the updates reproduced.