Chapter 3
Cloud Platform and Infrastructure Security

The cloud security professional must understand the architecture and infrastructure that supports each of the cloud delivery models. The infrastructure includes the physical components of the cloud, the services they provide, and the communication infrastructure that allows us to connect with the cloud. Each part of the infrastructure has specific security needs and shared security responsibilities.

In the shared security model of cloud computing, it is easy, but incorrect, to assume that security is the sole responsibility of the cloud service provider (CSP). Security practitioners need to understand the unique security requirements of cloud computing and the technologies, such as virtualization, that make the cloud possible. There are important differences between cloud security and security for on-premises systems, including responsibility that is divided between the CSP and the customer. A security professional must be clear on what the CSP provides with respect to security and which items are the customer’s responsibility to protect. In general, the data put into a cloud environment will always be the responsibility of the customer, but other aspects of securing the system will vary based on the cloud delivery model used.

Finally, the cloud security professional must understand how the cloud can support the organization's objectives. This may include crucial functions such as business continuity and disaster recovery, scalable architecture to keep up with demand, and cost savings that make entirely new business models financially viable for some organizations.

Comprehend Cloud Infrastructure and Platform Components

There are infrastructure components that are common to all cloud service delivery models. These components are all physically located with the CSP, but many are accessible via the network. In the shared responsibility security model, both the customer and the CSP share security responsibilities for the elements of the cloud environment. Those responsibilities are discussed with each component.

Physical Environment

The physical environment is composed of the server rooms, data centers, and other physical locations of the CSP. The identity of the CSP may vary by the different cloud security models, as follows:

  • A private cloud is often built by a company on its own premises, such as an IT group that runs cloud services supporting multiple business units or departments. If this occurs, the organization is the CSP, and the units or departments are customers; though they are all part of the same organization, costs associated with cloud services are usually tracked in the same manner as using a public cloud. This allows the organization to be very granular in identifying resource usage. Private clouds can also be virtual, and major CSPs including AWS, Azure, and Google Cloud offer the ability to use their infrastructure in an isolated virtual private cloud (VPC).
  • In a community cloud, a member of the community hosts the space and equipment needed and is responsible for physical security. The community member hosting the cloud is the CSP. Similar to VPCs, a community cloud may be built using a private section of a commercial CSP, and the CSP is responsible for physical security.
  • In public clouds, the CSP is the company providing the service, such as AWS, Microsoft, Google, IBM, and so on. This commercial vendor is the CSP and is responsible for the physical security of their cloud infrastructure.

The physical environment is under the sole control of the CSP, and it is their responsibility to provide all physical security. This includes responsibility for monitoring, auditing, and maintaining a secure environment. Security risks that affect cloud infrastructure are the same as those affecting any data center, regardless of whether they are on-premises or provided by a third party. These include physical security of the equipment, access control to the facility, environmental controls like heating and air conditioning, and utilities at the site like water and electricity.

CSPs utilize common controls to address these risks. For physical security these include standard measures such as locks, security personnel, lights, fences, visitor check-in procedures, and the like. Identity and access management (IAM) for CSP technical personnel will require tools including a single sign-on (SSO) provider, logging of user actions, and multifactor authentication (MFA) to provide robust access controls and logging.

CSPs will address controls for data confidentiality and integrity in a similar manner to cloud customers, but with much broader controls. For example, ensuring that communication lines are not physically compromised might involve locating telecommunications equipment inside a controlled area of the CSP's building or campus. This makes it easier to guard against and detect any physical tampering, such as wiretapping or cutting communication lines.

Network and Communications

Physical networking gear is housed in facilities controlled by the CSP, and it is the CSP's responsibility to physically secure the device as well as logical security controls like proper configuration. Any components housed at the customer's facility (on-premises) are the responsibility of the customer. The largest area of concern is the public Internet that exists between these two. Since the Internet is a shared resource, the responsibility for securing data transiting the Internet falls to the individuals using it, so both the CSP and cloud customer must identify and implement ways of securing data in transit. This will most often involve the use of cryptography.

For this reason, the CSP must support secure protocols like HTTPS, and the customer must use the secure protocols. Encrypting data prior to transmission is also an effective method for preserving confidentiality and may involve the customer encrypting the data before it leaves the customer's network or using a VPN to securely transmit data over untrusted networks like the Internet. At each end of the transmission pipeline, both the customer and the CSP are responsible for the firewalls and other systems needed to maintain secure communications. Some organizations may find the use of dedicated communication lines a more effective way of protecting their data in transit, avoiding the use of untrusted networks.

In the shared security model, the CSP provides tools for secure computing, logging, encryption, and so on. However, customers do retain responsibility for activating and configuring these features to meet their needs. In all service delivery models, it is the responsibility of the customer to connect to and transmit data to the cloud service securely.

Consider, for example, that you are a company shipping a product. You have many ways in which you can accomplish this.

  • You could create your own fleet, leasing aircraft from a supplier but providing your own pilots and cargo personnel. In this case, it is similar to an infrastructure as a service (IaaS) environment. The supplier is responsible for the aircraft provided, but not how you use the aircraft or the safety and security of your product.
  • In a different approach, the supplier may provide the aircraft and pilots, and you handle the storage and security of the cargo. You remain responsible for the safety and security of your product, and the supplier provides a safe aircraft operated by qualified personnel, equivalent to a platform as a service (PaaS) environment.
  • In the simplest method, you drop off the package with the supplier. Once it's in their possession, the supplier is responsible for the security and delivery of the package. They are not responsible for the manner in which the product is packaged or the condition of the product when it is dropped off with the service. Although they make a best effort to deliver the package in good condition, it is possible for the package to be damaged, so compensating controls like insurance are recommended. This is similar to using a software as a service (SaaS) environment.

This package delivery example is a useful metaphor for the security of data in transit with the IaaS, PaaS, and SaaS delivery models. Each model has potential benefits in terms of the amount of flexibility offered, which is offset by associated costs.

  • In an IaaS service model, the customer is responsible for configuring the environment as well as enforcing company policies in the use of systems, as if the systems were on-premises as well as the connection to the CSP. The CSP is responsible for the technology provided but not how it is used, and the cost of that responsibility remains with the customer.
  • In a PaaS module, the CSP is responsible for the physical components, the internal network, and the tools provided. The customer is responsible for the proper use of those tools and the connection to the CSP. Services provided by the CSP are likely to be cheaper than what the customer could do in house, due to the shared resources, but the customer also gives up some control.
  • In a SaaS model, the customer remains responsible for access to the cloud service in a secure manner—using appropriate technologies to connect with and transmit data to and from the service securely. Once the data is in the CSP, the customer has less control over it, but the shared nature of the cloud means these services are available at significantly lower cost.

Compute

The compute resources are the infrastructure components that deliver virtual machines (VMs), disk, processor, memory, and network resources. These resources are owned by and under the direct control of the CSP. The security issues are the same for these resources in a cloud environment as they are on-premise with the additional challenge of multitenancy.

The CSP in every delivery and service model remains responsible for the maintenance and security of the physical components. These are owned and supported by the CSP. The customer in every delivery and service model remains responsible for their data and their users. Between the physical components, there is a vast array of software and other components. Who is responsible for each of these remaining parts varies by service and delivery model and sometimes by the CSP. The contract between the customer and the CSP should spell out the responsibilities for each part of the cloud environment. Typical responsibilities will be described for each service model (IaaS, PaaS, and SaaS).

In an IaaS environment, the CSP provides the hardware components and may provide networking, virtualization, and virtualization operating systems. The CSP is responsible for the security of any software components it provides, such as virtualization and operating system (OS) software, and the CSP is also responsible for the versioning and security of those components.

When the CSP provides the virtualization and OS software, some of the software configuration may be left to the customer, such as controlling access to the VMs the customer creates in the cloud. When this is the case, the customer is responsible for the security implications of the configuration they choose, and in IaaS the customer has the most responsibility for security. Other than the hardware components and system software provided, the customer remains responsible for all other security for the tools they install, software they develop, and, of course, all identity and access management, customer records, and other data. These responsibilities include patching and versioning of the software installed or developed by the customer and the security of data at rest and in motion.

In a PaaS environment, the CSP takes on more responsibility for providing services, and the corresponding security requirements. In addition to the responsibilities of IaaS, the CSP takes on security responsibility for services including operating systems, database management systems (DBMSs), or whatever other platforms are being offered in the cloud environment. While the customer has some ability to configure these services, all services provided by the CSP will usually be maintained by the CSP, including patching and versioning. The customer is responsible for the configuration and use of all systems and services provided as well as the patching and versioning of any software they install or develop on the platform. The customer is always responsible for the security of their data and users. The contract with the CSP should address these issues.

In a SaaS environment, the customer is usually responsible for only the customization of the SaaS service, as well as the security of their users and the data used or produced by the service. The CSP is responsible for the security of all other compute resources.

Virtualization

There are two types of hypervisors that provide virtualization. These are Type 1 hypervisors, also known as bare-metal hypervisors, and Type 2 hypervisors, also known as hosted hypervisors. Among the major CSPs the Type 1 hypervisor is more common, such as the Xen hypervisor provided by AWS and the Hyper-V hypervisor provided in Azure.

Type 1 hypervisors run directly on a physical server and its associated hardware components, rather than running on top of a traditional OS. In this scenario, the hypervisor provides the OS functionality, with VMs running atop the hypervisor. The hypervisor and virtualization environment are managed with a management console separate from the hypervisor. VMs running on a Type 1 hypervisor can move between physical machines, and when done properly, it is invisible to the end user. In addition to Hyper-V and the XenServer hypervisors, other common Type 1 hypervisors include VMware vSphere with ESX/ESXi and Oracle VM.

Type 2 hypervisors are more complex, as they run on top of a traditional OS. At the bottom is the hardware, and running on top of that is a host OS, such as Windows, macOS, or Linux. The hypervisor runs as a program on top of the OS; examples include Oracle VM VirtualBox, VMware Workstation Pro/VMware Fusion, Parallels Desktop, and Windows Virtual PC. These are not usually used for enterprise solutions because they are less efficient than Type 1 hypervisor environments, but Type 2 hypervisors are frequently used for individual needs or environments like testing where absolute efficiency is not as important. Management of the VMs is built into the Type 2 hypervisor, and usually includes options like specifying individual VM compute, storage, and memory parameters.

The security of the hypervisor is always the responsibility of the CSP. The virtual network and virtual machine may be the responsibility of either the CSP or the customer, depending on the cloud service model as described in the earlier “Compute” section.

Hypervisor security is critical. If unauthorized access to the hypervisor is achieved, the attacker can access every VM on the system and potentially obtain the data stored on each VM. A VM escape is a type of attack in which a malicious user can break the isolation between VMs running on a hypervisor by gaining access outside their assigned VM. In both Type 1 and Type 2 hypervisors, the security of the hypervisor is critical to avoid hypervisor takeover or VM escape. In a Type 2 hypervisor, host OS security is also important, as a breach of the host OS can potentially allow takeover of the hypervisor and all associated VMs as well. Proper IAM and other controls limiting access to those with both proper credentials (authentication) and a business need (authorization) can protect your systems and data. In a cloud computing multitenant model, there is another challenge: once an attacker gains access to be on a server hosting multiple VMs, they can compromise systems or data belonging to however many customers are running workloads on that server. Security of the hypervisor is the responsibility of the CSP, and a robust set of controls should be in place to provide customers with assurance that their virtualized systems are safe.

CSP hypervisor security includes preventing physical access to the servers and limiting both local and remote access to the hypervisor. The access that is permitted must be logged, and hypervisor maintenance or administrative access should be thoroughly monitored. The CSP must also keep the hypervisor software current and updated with patch management procedures.

The virtual network between the hypervisor and the VM is also a potential attack surface. The responsibility for security in this layer is often shared between the CSP and the customer. In a virtual network, you have virtual switches, virtual firewalls, virtual IP addresses, etc. The key to security is to isolate each virtual network so that only permitted traffic is able to move between virtual networks. This isolation will reduce the possibility of attacks being launched from the physical network or from the virtual network of other tenants, preventing attacks such as VLAN hopping, while preserving desired system communications such as access by a web server to application servers in a different VLAN. The security of the network virtualization hardware and software is the responsibility of the CSP, while proper VLAN configuration for a specific application is the responsibility of the customer.

A control layer between the real and virtual devices such as switches and firewalls and the VMs can be created through the use of software-defined networking (SDN). In AWS, when you create a virtual private cloud (VPC), the software-defined networking creates the public and private networking options (subnets, routes, etc.). To provide security to the software-defined network, you will need to manage both certificates and communication between the VM management plane and the data plane. This includes authentication, authorization, and encryption.

The security methods for a virtual network are not that much different from physical networks. However, certain tools may be better suited to certain tasks in a virtual environment, and some legacy tools may not function at all. For example, some network monitoring tools require a device plugged into the mirror or span port of a switch so they can analyze traffic. However, a virtualized network does not provide such a capability, so these types of tools will not provide any security benefit. Using security tools designed specifically for virtual environments or even tools designed explicitly for cloud security are recommended.

The final attack surface in the virtualization space is the VM itself. Responsibility for VM security may be shared, but it is usually the responsibility of the customer in an IaaS model. In the PaaS model the security of the VM is the responsibility of the CSP if the platform is also the responsibility of the CSP, as in a hosted database platform. The CSP is responsible for securing the VM used to provide the hosted database in this model, while the customer would be responsible for secure configuration of the database. If the customer is creating VMs on top of a virtualization platform, then security of any VMs created remains the responsibility of the customer. In a SaaS model, the VM is created and used by the CSP to provide a service, and the responsibility of VM security rests on the CSP.

Storage

Various types of cloud storage technologies are discussed in Domain 1, “Cloud Concepts, Architecture, and Design.” Cloud storage has a number of potential security issues, and the importance of the shared responsibility model with respect to cloud storage cannot be overstated. Data spends most of its life at rest, so understanding who is responsible for securing cloud storage is a key undertaking. At a basic level, the CSP is responsible for physical protection of data centers and the storage infrastructure they contain, while the customer is responsible for the logical security and privacy of data they store in the CSP's environment. The CSP is responsible for the security patches and maintenance of data storage technologies and other data services they provide, while the customer is responsible for properly configuring and using the storage tools.

CSPs provide a set of controls and configuration options for secure use of their storage platforms. The customer is responsible for assessing the adequacy of these controls and properly configuring and using the controls available. These controls can include how the data is accessed, such as over the public Internet, via a VPN, or only internally from other cloud VLANs. The customer is also responsible for ensuring adequate protection for the data at rest and in motion based on the capabilities offered by the CSP. For example, the CSP may provide the ability to encrypt data when it is written to the storage device. If the encryption is done using a customer-provided key, then keys must be securely generated and stored to safeguard the data. Similarly, many cloud storage tools can be configured for either public or private access. Failure to properly configure secure storage using available controls is the fault of the customer.

In a cloud environment, you lose control of the physical medium where your data is stored, but you retain responsibility for the security and privacy of that data. These challenges include the inability to securely wipe physical storage and the possibility of a tenant being allocated storage space that was previously allocated to you. This creates the possibility of fragments of your data files existing on another tenant's allocated storage space. You retain responsibility for this data and cannot rely on the CSP to securely wipe the physical storage areas.

Compensating controls for the lack of physical control of the storage medium include only storing data in an encrypted format and retaining control of the keys needed to decrypt the data. This permits crypto shredding when data is no longer needed, rendering any recoverable fragments useless.

Management Plane

The management plane provides the tools (web interface and APIs) necessary to configure, monitor, and control your cloud environment. It is separate from and works with the control plane and the data plane. If you have control of the management plane, you have control of the cloud environment; similar to system administration tools, it is a high-value target of attackers and should be tightly controlled.

Control of the management plane is essential and starts with limiting and controlling access. An attacker who can gain access to the management plane will have free rein over your cloud environment. Since virtualized infrastructure is all accessible through management plane tools like a cloud console, the impact of a compromised account or malicious access is significant.

The most important account to protect is root, or any named account that has administrative/superuser functionality. The start of this protection is the enforcement of a strong password policy. The definition of a strong password is an evolving question that has recently been addressed by updated guidance in NIST in SP 800-63, Digital Identity Guidelines. The most important factor in a secure password is length—longer is better, so passphrases are preferred over passwords—and secure use practices like not sharing passwords or reusing them across sites or services. Human factors include making passphrases easy to remember but difficult to guess, which discourages bad habits like reuse.

A strong password policy needs to be coupled with other measures for the critical root and administrative accounts due to the broad access they grant. MFA should be implemented for these accounts and may take the form of a hardware token that is stored securely or other methods like an authenticator app on a trusted device like the user's smartphone. In general, software solutions add some protection, but not as much as the hardware solutions. Widespread MFA is also relatively new, and novel attacks against it are still developing, so any solution should be routinely reviewed as part of risk assessments to ensure that it is still providing adequate security. SMS was considered an acceptable method for MFA, but attackers found simple ways to capture the codes being sent, which rendered the protection meaningless.

Role-based access control (RBAC) or access groups are another method to limit access to these sensitive accounts. Using RBAC or access groups makes management of these groups and permissions important. If rights are not deleted when an employee changes positions or employment, access can become too broad very quickly. Another step is to limit access to users connecting from a known on-premises network or through a VPN, if remote work is required. This approach, however, is rapidly being replaced by zero trust network architecture (ZTNA), which does away with trusted networks in favor of verifying every user request. This minimizes risks associated with trusted insiders or malicious users who have gained unauthorized access to secure networks.

Another method for limiting access is to use attribute-based access control (ABAC), also called policy-based access control. Using this method, a variety of attributes can be used with complex Boolean expressions to determine access. Typical attributes such as username are one part of the policy, which can also include attributes like the usual time a user logs in, location they log in from, or even details of the device they are connecting from. ABAC can be used to ensure that only authorized users connecting from a corporate-managed endpoint are allowed access. If the user is accessing a system from a new geographic location, ABAC can be configured to require additional proof of the user's identity, such as an MFA prompt. Many social networks implement such a feature to prevent stolen devices from gaining access to social media accounts—even if the authorized user has traveled and taken a device with them, they must prove their identity by answering security questions since the device is not logging in from the normal, expected location.

Each of these methods can make accessing critical root or administrative accounts more difficult for both legitimate users and malicious users alike. How tightly you lock down these accounts is in direct proportion to the value of the information and processes in your cloud. As with all security measures, it requires a balance to create as much security as possible while maintaining reasonable business access.

Root and administrative accounts are typically the only accounts with access to the management plane. The end user may have some limited access to the service offering tools for provisioning, configuring, and managing resources, but should not be allowed access to manage the entire cloud environment. The degree of control will be determined by each business, but end users will normally be restricted from accessing the management plane. The separation of management and other workforce users makes the creation of separate accounts for development, testing, and production an important method of control.

In instances where the management functions are shared between the customer and the CSP, careful separation of those functions is necessary to provide proper authorization and control. In a Cisco cloud environment, the management plane protection (MPP) tool is available. AWS provides the AWS Management Console.

These are some of the methods that can be used to protect the cloud management plane. A layered defense is important, and the amount of work used to protect the management plane is, in the end, a business decision. The cloud security professional must be aware of the methods available for protection in order to be a trusted advisor to the business in this matter. Major CSPs including AWS and Azure publish security reference architectures, which detail best practices for utilizing each cloud's particular services and offerings. This includes security configurations, identity and access controls, and best practices like creating separate management plane accounts for production, testing, and staging environments.

Design a Secure Data Center

Designing a secure data center can be challenging with the physical siting; environmental, logical, and physical controls; and communication needs. In a cloud environment, many of these traditional concerns are the responsibility of the CSP or cloud vendor as they have physical control and ownership of the data center and the physical infrastructure. The customer may be able to review the physical, environmental, and logical controls of the underlying infrastructure of the vendor in limited cases but usually has no direct control over them.

The distributed nature of cloud computing environments makes security oversight even more challenging. Major CSPs like AWS, Google, Microsoft, and IBM run data centers all over the world. Even if the CSP allowed auditing of their physical infrastructure, customers would need to travel the world and audit hundreds of data centers! Most CSPs do not allow individual customers to perform audits, but they do provide third-party audit reports like SOC 2 Type II and the Cloud Security Alliance STAR Level 2 report. Customers can review these reports to gain assurance regarding the security controls implemented by the CSP.

Cloud customers have the ability to create a logical data center within the cloud environment. A logical data center is a construct, much like a container, where the customer designs the services, data storage, and connectivity within their instance of the cloud service, and it is effectively a virtual private cloud inside the public cloud infrastructure. The physical mapping of this design to the underlying architecture is not controlled by the customer as it would be in an on-premises data center; however, some CSP offerings do provide additional controls like geographic selection to prevent data from leaving specific countries.

Logical Design

The logical design of a data center is an abstraction. In designing a logical data center, the customer utilizes software and services provided by the CSP, and a chief concern in provisioning such a data center is implementing security controls appropriate to the data that will be handled in the cloud. The needs of a data center include access management, monitoring for compliance and regulatory requirements, patch management, log capture and analysis, and secure configuration of all services used.

In a logical data center design, a perimeter needs to be established with IAM and monitoring of all attempts to access the data. Access control can be accomplished through various IAM methods, including authentication and authorization, security groups, and VPCs, as well as the management consoles. Each major CSP offers services equivalent to software firewalls, traffic monitoring, and security monitoring like intrusion detection systems (IDS), which can be implemented to monitor data center activities and alert on potentially malicious behavior.

All services used should have a standard configuration that is reviewed and approved for use by the organization. This configuration is determined by the business and specifies how each approved cloud service is to be configured and can be used, and it is similar to hardening guides or configuration baselines in on-premises environments. Using a standard pattern/configuration makes administering and maintaining cloud services simpler and provides a consistent level of security. Deviations from approved configurations should be approved through an exception process that includes analysis of the risk presented by the deviation and any additional mitigations required. Secure baseline configurations can provide a more secure environment for the data center, and it is essential to monitor these configurations to detect drift. If unexpected changes occur, they should generate an alert and be investigated.

Connections to and from the logical data center must be secured to provide data in transit security. This may be achieved by VPNs, Transport Layer Security (TLS), or other secure transmission methods. With an increasingly remote and mobile workforce, remote access is increasingly important, and a careful logical design can help to provide a secure data center environment while also enabling a remote workforce to accomplish tasks regardless of location.

Tenant Partitioning

Multitenant models make cloud computing more affordable but create some security and privacy concerns. If the walls between tenants are breached, then your data is at risk. Multitenancy is not a new concept, as many business centers physically housed multiple tenants, and colocation data centers supported multiple customers. Both of these examples offered access controls designed to isolate individual tenants, but the risk remains that another tenant could breach the security controls and gain access to physical offices or computer systems belonging to other tenants. Both the provider and the tenant share responsibility for implementing, enforcing, and monitoring controls that address these unique risks in a multitenant environment.

In a cloud computing environment, the separation between tenants can be purely logical, unlike a colocation facility where physically separate server racks are provisioned for different customers. The vendor provides some basic security services, as well as maintenance and other services. Tenant partitioning is typically achieved through the use of access controls in the virtualized environment. Each tenant is able to access their own virtualized resources, but not the resources of other tenants.

Data that is placed into the shared environment is still the responsibility of the cloud customer. Compensating controls like encryption can be used to mitigate the risk of another tenant gaining access to your organization's virtual resources. Additional access monitoring and robust IAM tools provide additional mitigation of this multitenant risk.

In a physical business center, if you lock up your servers and all of your data but leave the keys in a desk drawer or with the business center owner, security is lessened. In the same way, the security provided by encryption is improved if the customer securely maintains their own encryption keys external to the cloud vendor. A well-architected encryption implementation can render attacks against physical hardware or hypervisors less effective, as the attacker gains access to encrypted data but not the keys needed to read it.

Access Control

When creating a logical data center, controlling access is a primary concern. A single point of access makes access control simpler and facilitates monitoring, but any single point can become a failure point as well. If you have a physical data center with multiple doors and windows, securing the data center is more difficult, and this is no different in a logical data center. However, having only a single door with no emergency exits is a recipe for disaster in the event the primary door is inaccessible.

One method of access control is to federate a customer's existing IAM system with the customer's cloud resources. Depending on the sophistication of the customer's IAM and their chosen CSP, this can be a simple and secure way to extend existing IAM policies and tools to new cloud services. This choice allows the customer to control access more directly, and it simplifies oversight into what users have access to and how they are using that access. Many cloud services also offer other options for existing IAM solutions, such as SAML integration that allows per-system integration with an IAM tool rather than full federation.

Another method to prevent cross-connection between cloud and on-premises resources is to use identity as a service (IDaaS) to provide access to a company's cloud services. Gartner refers to this as SaaS-provided IAM or simply SaaS IAM. An IDaaS solution has the benefit of providing a service that is tailored to cloud resources and services, and many IDaaS solutions are adding features that allow integration into legacy environments for organizations that have on-premises infrastructure that is still in use. A SaaS IAM may be included with another CSP service, such as Login With Google as part of Google Workspace SaaS tools, or it may be a stand-alone SaaS tool itself.

Regardless of whether a customer's current IAM system can be used, a well-educated workforce is a critical part of any access control system. Users who understand the fundamentals of access security, such as long passphrases and never reusing passphrases, are a critical element in preventing unauthorized access. This vital line of defense is best built by offering security training and reinforcing best practices through awareness materials.

Physical Design

Physical design is the responsibility of the owner of the cloud data center. This is generally the cloud vendor or CSP. Physical design considerations are the same as for on-premises data centers, but the CSP's task is generally more complex due to the size and requirement for multitenant access.

Location

Physical siting of a data center can limit some risks related to disasters or disruptions. Areas commonly impacted by natural disasters, civil unrest, utility interruptions, or similar problems should be avoided if possible. This is not always a possibility, as many large population centers are also located in areas prone to earthquakes, tornadoes, wildfires, etc. Despite these risks, organizations and people still expect access to services, so cloud data center designers must find ways to mitigate these risks while also locating data centers in areas close to their customers.

Multiple locations for data centers are a recommendation when designing an on-premises data center but can be so costly that organizations do not utilize them. Because of the size and scale of operations a CSP is likely to run, multiple data centers are a cost-effective way to mitigate risks related to disasters. No location is immune to all disasters, so locations that have different disaster risk profiles will increase the availability of cloud resources. The location should also have stable power/utilities and access to multiple communication paths when possible.

Buy or Build

Many business decisions come down to a fundamental question: is it better to build something custom to the organization or buy something premade by another organization? Build to suit is generally more expensive but allows the organization to build exactly what is needed, whether it is a data center, headquarters building, or even software. By contrast, buying a ready-made facility or commercial software is usually cheaper but may not meet all the organization's needs.

Most cloud customers have made the decision to buy their cloud services instead of building their own on-premises data centers and IT service components. If the cost savings generated by cloud services are greater than the costs associated with data security and regulatory compliance, then cloud computing is a sound business decision. If public cloud options are inadequate for the data sensitivity, then a private or community cloud might be a middle-ground option.

CSPs building public clouds generally opt to build data center facilities due to the scale of their operations. When considering a single data center, build versus buy is a legitimate concern, but finding existing facilities to build a global network of cloud computing data centers is unlikely. Community or private clouds may be hosted out of an existing data center or colocation facility due to cost savings of sharing some infrastructure with other tenants. Ultimately, the build-versus-buy decision requires well-documented requirements and a decision process designed to identify whether any existing facilities exist and can be purchased. If existing facilities do not exist or do not meet the organization's needs, then buying is the best course of action.

Environmental Design

The environmental design, like the physical location of a data center, is the responsibility of the CSP or cloud vendor, with the exception of the private cloud deployment model. Environmental design primarily impacts the availability of cloud resources, such as adequate cooling necessary to keep electrical equipment running. For the cloud customer, reviewing the basic design of a vendor or CSP's environmental controls can be part of the risk analysis of the vendor. Similar to physical security controls, this review may not be performed in person, instead relying on a third-party audit of the CSP's environmental design.

Environmental design is in many cases concerned with utility services needed to keep computing equipment running, as well as providing for habitation by support staff. This includes electricity needed to power the hardware comprising the cloud, as well as lighting needed by support personnel working in the facility. Some data centers will make use of water for cooling, while others will use air handling to provide cooling for electronics. These are also required by human occupants of the building, and guarding against leaks is a primary concern where water is needed for human occupancy but could cause damage to electrical equipment.

Heating, Ventilation, and Air Conditioning

Appropriate heating, ventilation, and air conditioning (HVAC) support is a requirement for data centers that provide cloud computing services. HVAC concerns will be dependent on the physical location and construction of the facility. A facility in Canada, which has very cold temperatures, will need to consider heating and ventilation more carefully than air conditioning (AC). Similarly, a data center near the equator will have greater concerns for cooling and will likely require more robust AC. Environmental concerns may even be a key consideration for the physical location of a facility—several major CSPs have placed data centers in locations with naturally cold climates, where cold outside air can be used without the need for additional cooling or refrigeration.

HVAC needs should be well understood and used as a requirement when choosing the site for CSPs building a data center or for customers reviewing potential cloud vendors. An HVAC failure can reduce the availability of computing resources much the same as an electrical or communication disruption. Because of the importance of HVAC in a CSP data center, customer reviews of the CSP should include the adequacy and redundancy of HVAC systems. In the event of an environmental failure, moving systems between data centers is possible if an application has been properly architected. This can provide a compensating control if a CSP's HVAC capacity is insufficient or suffers an outage.

A number of documents can help assess HVAC concerns. If a CSP has a SOC-2 Type II report with a review of the availability criteria, the report should contain details about the configuration and redundancy of HVAC systems. Because of the confidential nature of information contained in a SOC 2 Type II, most CSPs will require a nondisclosure agreement (NDA) prior to sharing. A routine review of the most current SOC 2 report is a critical part of a cloud customer's due diligence for their cloud service vendor. Other documents that may assist in determining a CSP's environmental design sufficiency would be business continuity and disaster recovery plans.

Multivendor Pathway Connectivity

Connectivity is critical in a cloud computing environment. The ability for a customer to remain connected to cloud resources requires planning, and the network-accessible nature of cloud services makes communication pathways a potential single point of failure (SPOF). While it is not a perfect network, the Internet rarely suffers complete failures, so in most cases a network issue is likely caused by the CSP or cloud customer's network service provider. Network connectivity provided by multiple vendors is a proactive way to mitigate the risk of losing network connectivity.

The cloud customer should consider multiple paths for communicating with their cloud vendor. In an era of an increasingly dispersed workforce, often working from locations separate from the business network, strategies to keep the workforce connected to the Internet and the cloud vendors must be developed and tested.

In a legacy on-premises network, this often required the organization to acquire network connectivity from multiple Internet service providers (ISPs), but the network-accessible characteristic of cloud computing has made this task easier. A system that users access from their workstations over the Internet can just as easily be accessed by users working from home. If a facility loses connectivity, users may be able to simply work from home. This can even be used for higher-security environments, such as those that require a VPN; as long as users can access the Internet, they can still access the cloud services.

Cloud providers must also deploy strategies that support multiple connectivity options. If the CSP cannot access the Internet, then all customers will be blocked from accessing their cloud services, leading to a denial of service. Best practice for CSPs or data centers is dual-entry, dual-provider for high availability. Physically separated cabling, providing connectivity to two or more ISPs, should enter the facility in two physically separated locations. This way if one vendor loses connectivity, the data center is still reachable, and if an accident affects one side of the building, the facility still has network connectivity from another location.

Design Resilient

Resilient designs are engineered to respond positively to changes or disturbances, such as natural disasters or man-made disturbances. A data center that loses connectivity each time a rainstorm happens is not resilient, while a data center designed with local weather conditions in mind will be able to withstand storms and continue providing service even when severe weather occurs. A-frame chalets are an example of resilient design: in areas of heavy snowfall a flat roof accumulates snow and will eventually collapse under the weight. A pitched roof allows gravity to pull the snow down without causing the roof to collapse, ensuring that the structure can withstand the anticipated weather conditions.

Data center and office facility designs can be made without considering local conditions. While functional, these structures will require additional maintenance and may be rendered unusable in the event of severe weather or man-made disasters. Resilient-design buildings adopt construction principles that make them less likely to be severely damaged or unusable. This includes locating critical systems like HVAC and electrical in areas where they will not be damaged by anticipated disasters such as flooding.

In 2012 Hurricane Sandy caused significant damage to the East Coast of the United States, and many buildings in New York City lost both primary power and backup power due to flooding. Backup generators used to provide emergency power were located in the basements of buildings, and floodwaters caused by the hurricane naturally gathered in these basements because they were below ground level—generators with combustion engines require oxygen and generally do not work underwater. A resilient design should take into account the likelihood of flooding and either locate backup generators away from likely flood areas or provide a method of preventing floodwater from submerging the generators.

Resilient design takes into account multiple physical and environment aspects, including electrical power, HVAC, human occupation, and design practices like natural heating and cooling alternatives that minimize reliance on electric-powered HVAC. These principles can be applied at a single building, at a community scale, or even at a regional scale. The Resilient Design Institute publishes strategies and principles for resilient design; more information can be found at resilientdesign.org.

Analyze Risks Associated with Cloud Infrastructure and Platforms

All data centers have risk, whether they are organization-controlled, provided by a colocation facility, or hosted by a CSP. Organizations moving from on-premises to cloud hosting must properly consider risks associated with such a move, and organizations that are cloud-native must ensure that they implement risk assessment and management strategies that account for the unique circumstances of cloud infrastructure. Utilizing cloud-based resources is no less risky than hosting your own infrastructure, but with proper controls in place and a thorough cost-benefit analysis, organizations can find both cost savings and increased security in the cloud. For many organizations, a hybrid environment is likely the best solution, as it provides greater control over high-value assets while also providing cost savings.

Risk Assessment

The process of risk management is fundamental to information security, since the entire practice involves mitigating and managing risks to data and information systems. Cloud-based systems can pose a challenge to risk management for a variety of reasons. Many organizations assume that the CSP is in charge of all security concerns and that customers have no responsibility.

This is untrue, as demonstrated by the explicit shared responsibility guidance published by most CSPs—they are responsible for some things like physical and environmental, but the customer ultimately owns the risk to their data and must manage it accordingly. Doing so can pose another challenge, as cloud providers are third-party vendors to most organizations. Assessing and mitigating risks posed by third parties requires modified risk management practices, and the customer must also be proactive in addressing their assignments under the shared responsibility model.

Identification

Identifying risks is the first step in managing them. Many risk management frameworks follow an asset-based approach, which begins with the organization identifying its critical assets. The definition of an asset can vary by organization, but in general an asset is anything essential to or deemed valuable by the organization, such as computer hardware, data, people, business processes, critical suppliers and vendors, and the like. These assets should be documented in an inventory.

Once assets are identified, security practitioners and risk managers can then begin to identify potential causes of disruption to the assets. This can be done as a brainstorming activity, with participants asking “What could go wrong?” Any event that could disrupt the confidentiality, integrity, or availability of data and systems is a potential risk, and risks can be categorized by their source, such as deliberate man-made, natural disasters, and errors or omissions.

Several risk frameworks exist that provide processes and procedures for designing and implementing a risk management framework. This includes the process of inventorying assets and identifying risks. Example risk frameworks include the ISO/EC 31000:2018, Risk Management; NIST SP 800-37, Guide for Applying the Risk Management Framework to Federal Information Systems; and IT governance frameworks like COBIT. These frameworks provide a structured methodology for risk management, as well as libraries of common risks and threats to help first-time assessors.

Risks specific to cloud environments should be identified when making the decision to use a cloud service. As previously discussed, some common risks include giving up physical control over the infrastructure used to process your data, and the possibility of less customization available when using SaaS. These risks should be monitored over the lifetime of the cloud service; if the risks change significantly, it might be necessary to move to an alternative CSP or bring the system back to on-premises hosting.

Analysis

Analyzing identified risks continues the conversation started by “What could go wrong?” and seeks to answer two questions: “What will the impact be if that goes wrong, and how likely is it to happen?” Risks are typically measured by these two criteria: likelihood of occurrence and impact to the organization.

For example, a transient network outage is highly likely, as most networks experience minor issues that can disrupt traffic. The impact will be determined by the importance of what that outage disrupts: users checking routine emails will be able to continue work until the network is restored. The impact to an emergency call center that relies on network connectivity for calls is higher, as people will be blocked from accessing critical resources. Risk management decisions will be based on these measurements—the email users do not require an expensive, high-availability network, but that mitigation would be justified for the call center.

Analysis of a CSP or cloud solution and the associated risks involves many departments. These include business units, vendor management, privacy, and information security. The new risks with a cloud solution are mostly associated with privacy and information security, though operational concerns should also be addressed. There are some key issues when conducting a risk assessment for a CSP or cloud solution.

One risk to address is authentication. Will the cloud solution provide authentication services, or is the customer responsible for providing that solution? Using the CSP's authentication solution may be simpler for operations, but the organization is giving up control over a key piece of its access management capability, including configuration choices and monitoring capabilities.

If the customer provides their own IAM system, it may be accomplished through a SaaS IAM solution or through federation with the customer's on-premises IAM manager. Each solution has pros and cons. For example, a SaaS IAM system is most likely designed specifically for the cloud, since it is a cloud-based platform itself. This can mean easier integration with disparate cloud service providers but again involves giving up control of a key system to a service provider. Federating an on-premises IAM with the cloud provides more control for the customer but can introduce compatibility issues for legacy IAM systems that are not cloud-native. Operationally, the cost of running the IAM must also be considered, and a SaaS IAM may get support from the finance team since it has cost advantages compared to on-premises infrastructure.

Data security is always a concern when utilizing a third-party service provider. How a vendor encrypts data at rest is important, including the strength of the cryptography used and access controls that prevent unauthorized access by cloud service personnel or other tenants. Security of the data in transit between the organization and the CSP is also critical. In addition, the vendor's third parties can introduce additional risk, so adequate supply chain risk management (SCRM) is required. Data remains the responsibility of the customer even when stored on a vendor's system.

Assessing risks posed by a vendor's policies and processes is a crucial part of SCRM. This include the vendor's privacy policy, incident response process, cookie policies, information security policy, etc. You are no longer assessing only your organizational policies but also the policies of the organizations with whom you are doing business, since they have access to your data and systems. For some key systems, it is important to assess the support provided for incident response. This includes an assessment of logging support, vulnerability scans of the vendor, application vulnerability scans, and external assessments of the vendor being considered or currently used.

Most CSPs do not allow direct auditing of their operations, due to the number of customers they support. Instead, they provide standardized reports and assurance material regarding their security practices, such as a SOC 2 report, ISO 27001 certification, and more specialized reports showing control in place for specific types of regulated data. These may include HIPAA for U.S. healthcare data; ISO 27017 and 27018, which deal specifically with cloud services and processing of PII in cloud computing; and FedRAMP for U.S. federal government cloud services. Reviewing these materials allows the organization to analyze the risk and effectiveness of any vendor-supplied controls.

Common Cloud Risks

One risk that has been discussed is the organization losing ownership and full control over system hardware assets. While this change can be difficult, with careful selection of CSPs and the development of SLAs and other contractual agreements, these concerns can be addressed. In addition, the service model used affects the amount of control that is retained, so the organization may be able to balance cost savings with risk by building a system on top of IaaS or PaaS, rather than utilizing a SaaS solution.

Regardless of which deployment or service model is used, there are some risks that are common to all cloud computing environments. These risks, and common mitigation strategies, can include the following:

  • Geographic dispersion of the CSP data centers: The disaster risk profile may be unknown to the customer and may be very different from their own. For example, the CSP data center may be in an area subject to hurricanes, while the customer is not. If the cloud service is properly architected, a disruption at one data center should not cause a complete outage, and it is incumbent on the customer to verify the resilience and continuity controls in place at the CSP.
  • Downtime: CSPs must be network accessible in order for customers to utilize the service. Resilience for network disruptions can be built in multiple ways, such as multivendor connectivity to individual data centers, and the availability of multiple regions or zones should a data center suffer an outage.
  • Compliance: Some data types are regulated, and the use of cloud services can be problematic. For example, privacy data in some jurisdictions cannot be transferred to other countries, so a cloud solution with globally distributed data centers is not appropriate for that type of data. Major CSPs have compliance-focused service offerings, and some cloud services can be architected to avoid issues like transfer of data outside specific geographies.
  • General technology risk: Cloud systems are not immune to standard security issues like system abuse or cyberattacks. CSPs can be large and attractive targets to attackers due to the ability to cause significant damage to a wide number of organizations that utilize the cloud. The CSP's cyber defenses should be documented and tested, and customers should require details of these activities in the form of audits and reports from activities like penetration testing or red team exercises.

Cloud Vulnerabilities, Threats, and Attacks

The primary vulnerability in the cloud is that it is an Internet-based model. Anyone with access to the Internet has the potential to attack your CSP, your cloud provider, or you. Even organizations that deploy their own connectivity to a cloud provider could be at risk if the CSP's public-facing infrastructure comes under attack. A private cloud is unlikely to face such an attack, but this protection comes at higher cost than public cloud services.

Any attack on your CSP or cloud vendor may be unrelated to you as an organization, so typical threats considered when assessing risks may not be adequate to address the threat model of a cloud environment. Threat actors may be targeting the CSP or another tenant of the CSP, or the CSP might be vulnerable to location-based threats due to weather or geological conditions in another part of the world. Customers of the CSP may simply be collateral damage; a denial-of-service attack against the CSP becomes an attack against all the cloud customers as well.

Risks can come from the other tenants as well. If protections keeping customer data separate fail, your data may be exposed to another tenant, who could be a competitor or bad actor. Even if the data exposed is not used maliciously, it is still a breach and may trigger legal or regulatory issues. Encryption is the best protection, with customer-managed keys that make any exposed data worthless to outsiders. The customer may also consider not storing their most sensitive data in the cloud, and a hybrid cloud approach with data stored in public cloud platforms can be an effective mitigation.

There can be an additional risk from the cloud vendor. Employees of cloud vendors have been known to exfiltrate customer data for their own purposes. Contractual language may provide the only remedy should this occur, though CSPs do design their services to minimize these risks—otherwise no customers would trust them with sensitive data or systems. Prevention becomes the best defense. As with other tenants, encryption with the customer managing their own keys (separate from the cloud) prevents data exposure.

The Cloud Security Alliance publishes the “Egregious 11: Top Threats to Cloud Computing.” This is a list of common threats and risks to cloud computing, as well as recommended mitigation strategies that cloud customers can implement. The full list is available here: cloudsecurityalliance.org/artifacts/top-threats-to-cloud-computing-egregious-eleven.

Risk Mitigation Strategies

There are several approaches to risk mitigation in cloud environments. The start of security is with the selection of a CSP, and a set of documented requirements and comparison of CSP offerings against those requirements is a key due diligence activity. Choosing a CSP that is incapable of meeting the organization's security and operational needs is inherently risky, so selecting a qualified CSP is an essential first step.

Once the CSP is selected, the next step is designing and architecting systems and services. Security should be considered at every step and designed in from the beginning. CSPs offer a wide range of services, so well-documented requirements can once again be useful. If encryption of data at rest is important, then the customer must choose a cloud storage offering with adequate encryption capabilities to start with and then properly configure the service to meet the organization's needs.

The next risk mitigation tool is encryption, which is an essential countermeasure for data security when, as is the case in cloud computing, data is outside the organization's control. Encryption should be enabled for all data at rest and data in motion. Most CSP offerings include encryption service; depending on the type of service, there may be some configuration action required by the customer. Data in transit must be protected using TLS, IP security (IPSec), VPNs, or another encrypted transmission method, while data at rest can often be encrypted by the CSP's storage platforms using customer-controlled keys. In addition, limiting the ingress/egress points in each cloud service can provide monitoring capabilities to ensure that data security measures are adequately applied.

Each major CSP provides the ability to manage your secure configuration, monitor changes to cloud services, and track usage. For example, AWS provides Inspector, CloudWatch, CloudTrail, and other tools to assist your management of your cloud environment. However, just like other tools, it is incumbent on the customer to properly use them for implementation and monitoring of security. If you log all system activity but never review it, then suspicious activity will go unnoticed.

Plan and Implementation of Security Controls

The risks associated with cloud computing can be mitigated with the proper selection of controls. This is the same approach used in traditional risk management tailored to the risks associated with cloud computing. Some controls will be designed to compensate for the loss of direct control when using cloud computing, while others will be similar to the measures used for managing risk throughout a supply chain. In both cases, the organization is giving up some direct involvement in return for cost savings and other operational efficiency, so the security program must ensure that these business benefits are not outweighed by security risks.

Physical and Environmental Protection

The location housing the physical servers in a cloud environment must ensure adequate protection against physical threats, like natural disasters, as well as provide necessary environmental controls to ensure that systems remain operational, such as power and HVAC. These controls are usually the purview of the CSP, meaning that a third party is responsible for them, though in a community or private cloud delivery model the organization consuming cloud services may have responsibility.

The primary consideration is the site location, as it will have an impact on both physical and environmental protections. A data center along the waterfront in an area subject to regular hurricanes or flooding is likely to be routinely damaged by these threats. If possible, a less susceptible location should be chosen; otherwise, additional locations and redundant system architecture may be required. For large CSPs, this risk is easily compensated for; the globally dispersed network of data centers utilized by a modern CSP results in data centers that are unlikely to be affected by the same threats.

Once facilities are constructed, there are additional physical and environmental requirements that must be addressed. Cloud data centers share requirements with traditional colocation providers or individual data centers. These include the ability to restrict physical access at multiple points, ensuring a clean and stable power supply, adequate utilities like water and sewer, and the availability of an adequate workforce.

A cloud data center also has significant network capability requirements. All data centers require network capabilities, but a CSP typically serves more customers than other data centers. More than one ISP and redundant cabling into the facility may improve the reliability of connectivity.

The customer has no control over the physical location of a cloud data center, except for a private or community cloud where the customer is also a service provider. Customers of public clouds do have some control over where their data is stored and processed, as most CSPs offer services that can be restricted by geography. To the extent possible, the customer should review the location of cloud data centers and be aware of the cloud vendor's business continuity and disaster recovery plans, as a CSP outage will lead to a loss of system availability. The ability of the cloud vendor or CSP to respond to disasters directly affects the ability of the cloud customer to serve their customers. Ensuring that cloud applications are properly architected to be resilient against loss of a single cloud data center is also an effective control against physical and environmental risks.

System, Storage, and Communication Protection

Properly securing information systems can be a difficult task due to the sheer number of elements that make up a system. Breaking systems down into components and applying security controls at this level can make the overall task more manageable. Information systems commonly comprise processes and operational technology like servers and applications, storage media where data resides, and communication pathways used to place information into the system and retrieve it when needed.

One source for controls is NIST Special Publication 800-53, “Security and Privacy Controls or Information Systems and Organizations,” which contains a family of controls specific to systems and communications. Section 3.18, “System and Communications Protection (SC),” is a comprehensive set of guidelines for addressing system- and communications-specific risks, and similar controls can be found in ISO and other control frameworks. The NIST SC control family includes 51 specific controls, including the following:

  • Policy and Procedures: This is a primary control, as policies and procedures serve as a foundation and guide to other security activities. They establish requirements for system protection, and define the purpose, scope, roles, and responsibilities needed to achieve it.
  • Separation of System and User Functionality: This is a core control. Separation of duties is a fundamental security principle, and separating system administration functions from regular end user functions can prevent users from altering and misconfiguring systems and communication processes.
  • Security Function Isolation: Separating security-specific functions from other roles is another example of separation of duties. Security functions can include configuring data security controls like encryption and logging configuration, as well as separate hardware-based security modules for high-security systems.
  • Denial-of-Service Protection: A DOS attack is a threat to all information systems and specifically those that rely on network connectivity as cloud services do. Preventing a DOS attack involves dealing with bandwidth and capacity issues and detecting attacks. Most CSPs offer DOS mitigation as a service, and there are also dedicated providers like Akamai and Cloudflare.
  • Boundary Protection: This control deals with both ingress and egress protections. This includes preventing malicious traffic from entering the network, as well as preventing malicious traffic from leaving your network, protecting against data loss (exfiltration), and applying protection mechanisms like routers, gateways, or firewalls to isolate sensitive system components.
  • Cryptographic Key Establishment and Management: Cryptography provides a number of security functions including confidentiality, integrity, and nonrepudiation. Ensuring that keys are securely generated, stored, distributed, and destroyed is crucial, especially in cloud environments where physical control is lost over storage media and information system components.

Cloud computing has a shared security model, and which controls are the responsibility of the CSP and which are the responsibility of the customer should be carefully reviewed and understood. For example, the CSP should have a strategy to maintain service availability in the event of a DOS attack, but organizations can also choose services or architect applications to utilize an alternate cloud region or CSP if one is not available.

The cloud customer can also implement controls for system and communication protection, like redundant connectivity in the event of a network outage. This can be achieved via multiple ISPs serving an office building, or users might be assigned cellular hot spots that can be used when wired communications are unavailable. In addition to availability considerations, data confidentiality and integrity in transit must be addressed. Encryption tools like TLS or a VPN can be used to provide confidentiality. Hashing can be implemented to detect unintentional data modifications, and additional security measures like digital signatures or hash-based message authentication code (HMAC) can be used to detect intentional tampering.

Protection of business data remains a critical issue, and since data spends a significant amount of time on storage media, this is a crucial area of focus. Encryption is, once again, a primary method of providing security for data at rest and is also a key area of shared responsibility. The CSP is responsible for maintaining adequate hardware security as well as configuring storage media to utilize encryption.

Cloud customers must choose and configure storage solutions to meet their needs, such as disabling public access to cloud storage. For added protection, client-controlled encryption may be used, and this requires the customer to securely provision, store, and manage encryption keys. Public cloud storage may be inappropriate for very high-security systems or highly sensitive data, so architecting a hybrid cloud with private data storage can be a viable option.

Identification, Authentication, and Authorization in Cloud Environments

Organizations with existing IAM solutions may choose to extend those solutions to newly acquired cloud systems. Organizations may also choose to acquire a SaaS IAM, sometimes known as identity as a service (IDaaS), and this is appropriate for organizations making a clean start with cloud computing services rather than migrating. This is a business decision that must account for existing technology, future growth, and the availability of adequate solutions that meet the organization's needs.

Some CSPs offer their own IAM solution that integrates with other services offered by the CSP. This includes AWS IAM for AWS cloud environments, and Azure AD in the Microsoft Azure cloud. Use of these solutions is simple if all cloud services are running in the respective CSP's environment, but they may be less optimized for other services. The decision to use such an IAM should balance the benefits of efficiency with potential risks like vendor lock-in, single point of failure if all critical services are in the same CSP, and the security capabilities of the CSP's IAM solution.

IAM practices can be generally described by the acronym IAAA, which stands for Identification, Authentication, Authorization, and Accountability. IAM systems should provide capabilities in each of these areas, and some cloud-specific considerations for each are as follows:

  • Identification: Users assert an identity by providing something unique, such as a username or employee ID number. An organization with different IAM solutions for cloud and on-premises environments may face difficulty with visibility into user activity but may also encounter compatibility issues between legacy on-premises IAM and cloud environments. Modern IDaaS tools were built with broad compatibility in mind, and they may provide the best solution to bridge on-premises and cloud.
  • Authentication: Users must prove the identity they assert, typically by providing authentication material like a password, PIN, or biometric scan. Common authentication material like passwords can be managed with an IAM solution, and additional security options like multifactor authentication (MFA) should be considered where additional security is required. IDaaS providers offer MFA solutions like smartphone apps that generate MFA codes from Microsoft, Okta, and Google.
  • Authorization: Users who are successfully identified and authenticated are granted access to the resources for which they are authorized. Managing authorization across cloud systems can be challenging as different clouds will implement different permission models and require unique administration. For example, an organization that utilizes Google Workspace for collaboration, Dropbox for file sharing, and Salesforce for customer management will need to administer user authorization in three separate tools. Solutions like a cloud access security broker (CASB) can be implemented to centralize this control and provide a meta later for administering user access that is automatically configured on various target services by the CASB.
  • Accountability: Users who take action on a system are held accountable for following policies and procedures. Accountability is typically enforced with adequate logging and monitoring of system activity, which is covered in a later section on audit mechanisms.

In any IAM deployment, user education is important. End users often take actions that can weaken IAM controls, such as reusing or sharing passwords. An employee may use the same username and password on all work-based systems. If a vendor or CSP system is compromised, then other corporate systems will be similarly compromised since attackers have valid access credentials. The cloud vendor may not notify you of a compromise of the IAM for an extended period of time, if they are even aware of a compromise. This puts your business systems at further risk.

Cloud applications and platforms are, by definition, accessible from anywhere. While this supports a globally distributed workforce, it can also make identifying suspicious user behavior more challenging. In a legacy environment all users access an application hosted in a data center from their workstations in an organization-controlled facility, but a cloud-based SaaS application can have users logging in from various countries as they travel. This makes it more difficult to identify potentially malicious behavior, since a user logging in from a different country is no longer inherently suspicious activity.

More flexible authentication capabilities provide one answer to these challenges. For example, users who log in every day from the same IP address and browser are likely authentic, correctly authorized users. By contrast, a user who logs in from Argentina one day, China the next day, and Canada the day after could be a traveling salesperson, or it could indicate a user whose credentials have been stolen, and attackers are attempting to use the stolen credentials from various countries. Conditional authentication policies, such as requiring MFA when a user logs in from a different country or workstation, are a common feature of cloud computing platforms. They strike a balance between usability and security; requiring MFA at every login is an inconvenience to users, but it is warranted for users whose activity is highly irregular.

Audit Mechanisms

It can be more difficult to audit systems and the processes of a CSP, just as with any vendor that is outside the organization's direct control. A customer will not have broad access to the physical security of cloud data centers, vendor networks, or activity logs maintained by the CSP. This is due to both logistics and the presence of other tenants. Most CSPs have a number of data centers that are geographically dispersed and hundreds or thousands of customers. The logistics of customers performing physical security control audits make the practice virtually impossible, and similar challenges exist when performing system or network audits.

Broad access to the vendor network is also a security risk. If you have access to the network, you may intercept privileged information belonging to another customer, violating the boundaries needed in a multitenant environment. If the CSP provides one organization with access to data that violates these boundaries, other organizations will trust the CSP less with their sensitive data.

Some cloud services provide access to log data, some do not provide access, and many provide limited access to organization-specific data that does not violate the confidentiality of other tenants. These limited logs can be useful but may not provide all the needed details if a security incident occurs. Different service models will provide different levels of logging ability; for example, IaaS allows the customer to build virtually all the logging infrastructure they need, while SaaS logs may only provide information about user actions and customer-specific application details like patch deployments.

Log Collection

One common problem with logs is the volume of data collected in logs. Setting thresholds is important to ensure that only relevant events are captured, which can cut down on the overwhelming amount of data. Cloud services will offer different controls over what information is logged, but at a minimum security-relevant events such as the use of or changes to privileged accounts should be logged.

Unusual system activity should also be logged, but this will be very organization-specific since normal activity varies across organizations. Determining events to be logged can be difficult, but there are resources that provide useful guidance. NIST SP 800-53 contains a family of controls related to audit logs, and specific controls like AU-3 provide guidance on specific information to capture in audit records. OWASP also has a cheat sheet for logging that is generally applicable to any system, though it is focused on application-level logging. It can be useful as a guide when configuring SaaS logging capabilities and is found here: cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html.

A log aggregator can ingest the logs from all the on-premises and cloud resources for review in the SOC. These logs must be monitored daily with privileged account alerts getting immediate attention. If not reviewed, logs become much less valuable. Log collection will remain valuable for incident response and audit purposes. But without regular and consistent log review, the value of log collection is not fully realized.

Correlation

Log data is generated to provide an account of past activity. Log data generated across multiple systems, such as an IAM platform, user workstation, and various cloud applications, can make it difficult to find relationships due to the volume of data. Correlation refers to the ability to discover relationships between two or more events. Data points in logs can be correlated to gain important insight into system activity, such as establishing a baseline of normal activity and detecting anomalous events.

A user logging into their workstation at a normal time of day—during regular business hours—and accessing a cloud-based webmail client is a normal event. That same user logging in from a mobile device in a different country might be normal, if the user frequently travels, or it might indicate that the user's credentials were compromised. Correlating the information in logs, such as baseline normal activity and other user activity, can allow for the detection of potential security incidents.

Log centralizations, correlation, and other activities like detection and alerting are often performed by a security information and event management (SIEM) platform. These tools and capabilities are discussed in more detail in Domain 5, “Cloud Security Operations.”

Packet Capture

Packet capture, sometimes shortened to pcap, refers to the capture of network communication packets. These can be used to monitor activity on a network, detect malicious network activity like data exfiltration, and can be useful in detecting certain types of network attacks like worms propagating between systems. Packet capture can be performed by running applications like Wireshark on a computer attached to a network or by copying data that is sent through a network device like a router or switch.

Since cloud environments are broadly network accessible and are maintained by third parties, packet capture can be difficult or impossible. If users are not connected to a centrally managed network, then packets sent between their workstations and the cloud are difficult to capture without specialized software running on the device they are using. Similarly, the cloud environment may not provide any facility for capturing packets. This is often the case with SaaS applications that run in a web browser; the user's network traffic is encrypted from their workstation to the CSP's network; even if the packets are captured, the data is encrypted and essentially worthless for monitoring purposes.

If packet capture is necessary in a system, some CSPs provide tools that allow packet capture functionality to a limited degree. If a customer is utilizing a hybrid cloud or mix of on-premises and cloud systems, packet capture is possible in customer-controlled networks. To achieve the same functionality in a CSP's public cloud, tools such as the following might be used:

  • Amazon provides AWS VPC Traffic Monitoring. This tool allows a customer to mirror the traffic of any AWS network interface in a VPC they have created and to capture that traffic for analysis by the customer's security team. This can be done directly in AWS using CloudShark to perform network analysis on the packet capture. This tool creates what is essentially a virtual network tap, and it is able to capture all network traffic in the customer's virtual private cloud. Non-VPC traffic cannot be monitored, as this could lead to a breach of other tenants' data.
  • Microsoft provides a similar capability with Azure Network Watcher. This tool allows packet capture of traffic to and from a customer's VMs. Packet capture can be started in a number of ways. This toolset also allows for packet capture to be automatically triggered by certain conditions, reducing the amount of noise that investigators must navigate.

The specific tools available and the use of these tools will change over time. For security purposes, the ability to capture packets in the cloud services used by the customer can be important. This is a requirement that the security team should thoroughly document prior to a cloud migration, and the organization should evaluate cloud services to ensure that they meet these needs.

Plan Disaster Recovery and Business Continuity

The cloud has transformed both disaster recovery (DR) and business continuity (BC) by providing the ability to easily and cost-effectively operate in geographically distant locations and by providing greater hardware and data redundancy. All of this leads to lower recovery time objectives (RTOs) and recovery point objectives (RPOs) at price points organizations could not achieve previously. While DR and BC are still important, this change has also enabled organizations to plan for cyber resilience in new ways—rather than making plans for what happens if a data center is destroyed, cloud applications can be architected to remain online even if multiple data centers, system components, or network segments fail.

Business Continuity/Disaster Recovery Strategy

Business continuity planning is concerned with ensuring that the organization is able to continue functioning after an adverse event has occurred. The result of the planning process is a business continuity plan (BCP), sometimes called a continuity of operations plan (COOP). This is a proactive risk mitigation strategy, as the BCP contains likely scenarios that could affect the organization and provides guidance on how the organization should respond.

Not all the processes an organization executes will be covered in a BCP or COOP. Some processes are essential to the organization's continued functioning, such as the ability to handle payments or communicate critical information to employees, while others can be considered “nice to have,” such as internal communications about company social events. The business impact assessment (BIA) is used to determine which processes are critical and which are not. The impact of specific systems and processes is measured by the BIA, and any that are deemed critical to the organization's functioning must be prioritized in an emergency situation.

For example, many financial services firms are required to plan for pandemic readiness assuming that some percentage of staff are not available due to illness and that even healthy staff may be restricted from working in the office to contain the pandemic spread. In a scenario like this, the organization must have adequate remote work capabilities, such as a VPN, and necessary supplies for workers to utilize when working remotely. This ensures that critical functions of the firm, such as handling financial transactions, can continue even with reduced staffing levels. Cloud services can be easily integrated into this plan, as the broad network accessibility they provide means that workers can easily log in and work from any location.

Other common BCP scenarios include risks like natural disasters, loss of utilities, and civil unrest or war. If needed, the BCP is activated by formal declaration of a disaster, and the organization switches into contingency operation mode following the guidelines established in the BCP. The goal of the BCP is to maintain critical business processes and operations, which may involve alternate work locations, alternative business processes, or even the use of cloud environments in place of on-premises data centers if those facilities are unusable.

While a BCP keeps business operations going after a disaster, the DRP works on returning business operations to normal. Not all incidents will require the activation of both plans. A major ISP failure might activate the BCP and send workers to an alternate building while the ISP restores the connectivity. In contrast, a natural disaster like a tornado that destroys an office or data center will require activation of both the BCP and DRP.

If a tornado renders a primary data center unusable, procedures in the BCP are used to handle shifting to an alternate processing facility, such as temporary use of cloud computing. The DRP is activated to handle the process of cleaning up and restoring or replacing the destroyed facility. DRP operations are concluded when the infrastructure and facilities are returned to their normal, operational state; the BCP operations are concluded when the organization's processes return to the pre-incident state using the restored or rebuilt infrastructure.

A cloud data center that is affected by a natural disaster will likely activate multiple BCPs and DRPs. The cloud provider will obviously activate both plans to deal with the interruption to their service; one key element of the BCP is communicating incident status to relevant parties. In this case, the CSP must communicate to all customers impacted by the incident. In turn, the customers may activate their BCPs or DRPs as needed to handle the interruption to their services, such as migrating applications to another cloud region or possibly even another CSP for the duration of the outage.

The cloud supports the BCP/DRP strategy. The high availability of cloud services provides strong business continuity and can serve as a core part of an organization's BCP strategy. In effect, the cloud allows the business to continue regardless of disasters affecting the customer's business facilities. The high availability of cloud services also impacts DRP strategy, as the cloud can provide resilient services at cost-effective price points. Customers may focus more on resilient services that survive a disaster rather than processes to recover from disasters. As always when discussing security, BC and DR strategies should always prioritize the life, health, and safety of humans over systems and data.

The customer is responsible for determining how to recover in the case of a disaster and may choose to implement backups, or utilize multiple availability zones, load balancers, or other techniques to provide disaster recovery and resilience. A CSP can help support recovery objectives by not allowing a data center to have two availability zones. Otherwise, if you are using two zones and they are in the same data center, a single disaster can affect both availability zones. CSPs can also provide monitoring that can be used as an early warning system, such as a spike in traffic or utility issue that could lead to an incident. This gives any customers the ability to take preventative actions like shifting to other availability zones or activating alternate processes.

Business Requirements

Business-critical systems require more recoverability than is often possible with local resources and corporate data centers. Designing high availability or resilient systems in such an environment requires significant resources and is cost-prohibitive for many organizations. Cloud computing provides options to support high availability, scalable computing, and reliable data retention and data storage at significantly reduced cost.

Although it is cheaper to build highly resilient systems in the cloud, there are still costs associated with such an architecture. To justify system and architecture resiliency decisions, there are three important metrics that the organization should consider. These metrics are outlined in the following sections, and include RTO, or how long are you down; RPO, or how much data may you lose; and recovery service level (RSL), which measures how much computing power (0 to 100 percent) is needed for production systems during a disaster. This does not usually include compute needs for development, test, or other environments, as these are usually nonessential environments during a disaster.

Recovery Time Objective

The amount of time the organization is willing to do without a system can be used to establish the recovery time objective (RTO). The organization may only be able to function for a matter of minutes without a mission-critical system, so adequate resources and designs must be implemented to ensure that disruptions are resolved swiftly. Such a system would have a very short RTO. Less critical systems, like email or collaboration tools, can go offline for hours or days before the organization suffers permanent harm, so these systems will have a longer RTO.

All systems should be categorized based on their criticality to the business, and this is usually achieved by performing a business impact analysis (BIA). Once prioritized, all systems should have a maximum tolerable downtime (MTD) established, which represents the absolute limit of time the organization can live without the system. The RTO should be less than the MTD; otherwise, a system disruption will have significant negative consequences.

At the end of the day, RTO is a business decision and not an IT decision. The role of IT is to support the business with options and costs. When conducting a BIA, many system owners will rank their systems as critical or essential but then reevaluate that ranking once the BC and DR costs of a short RTO are presented. Business leaders need this information to make decisions about system design, architecture, and recovery strategies, and once a decision is made, IT is responsible for implementing the resources needed to recover within the defined RTO.

Recovery Point Objective

The RPO is a measure of the amount of data that the business is willing to lose if a disaster or other system disruption occurs. RPO is most often used to guide the design and operation of a backup strategy, and RPO can be expressed as a number of transactions or amount of time. If the organization can afford to lose the last 24 hours of data but no more, then backups must occur at least daily. A weekly backup would fail the 24 hour RPO, because up to seven days of data could be lost if a disaster occurs.

Backup frequency and cost are generally correlated; more frequent backups require more resources for data storage and compute power. As with RTO many business leaders might want the most robust backup strategy possible, but the costs associated must be considered against the value of the data. Manual weekly or monthly backups of nonessential systems may be appropriate given their value. Regulated data like PII or financial information needed for public disclosure might carry regulatory fines if reasonable data backups are not performed, so the cost of noncompliance is greater than backup costs.

RPO can also be measured by the number of transactions the organization can afford to lose. These do not necessarily have to be financial transactions, but instead refer to data transactions. A user enters information into a web application and hits Save; the steps needed to process and store that data constitute a transaction. A low RPO might state that only the current in-process transaction can be lost, which means that all previous transactions either are backed up or can be re-created.

Cloud storage solutions can make such a low RPO cost-effective, because they are designed for high data durability. Even if there is a system failure, the data has been duplicated and is stored in multiple locations for recovery. Alternatively, application-level controls might be implemented to allow transactions to be recovered if an error occurs. One such approach is journaling transactions, which creates a separate journal of activity performed. If a database or storage media is destroyed or corrupted, the actions in the journal can be replayed to reconstitute the data and meet the RPO.

Recovery Service Level

RSL measures the compute resources needed to keep production environments running during a disaster. This measure, from 0 to 100 percent, gives an indication of the amount of computing used by production environments when compared to other environments like development, test, and QA. These nonproduction environments can be shut down during a disaster, with the goal of reducing unnecessary resource utilization.

During a disaster, the focus is on keeping key production systems and business processes running until the disaster is largely resolved and other activities can restart to pre-disaster levels. The customer organization should identify these service levels and ensure that any BC or DR actions do not implement processes or services that do not meet the RSL. For example, shifting an on-premises workload that requires 100 web servers to a cloud environment provisioned with only 10 web server VMs may not provide adequate system availability. If the workload requires 90 servers for development and testing activities, then the BCP and DRP should specify that only the 10 production servers are to be migrated in the event of a disaster. Testing and development activities can follow ad hoc procedures to recover, or those activities may be suspended until the normal environment is recovered or rebuilt.

Creation, Implementation, and Testing of Plan

There are three parts to any successful BCP/DRP, and each is vital to the success of the organization. These include the creation, implementation, and ongoing maintenance and testing of the plans.

Because of their inherent differences, BCPs require different roles and responsibilities within the business. Although frequently combined and considered a single practice area, the two plans do require different sets of resources and focus on different activities. If the business does not continue to function, which is the goal of the BCP, there is no reason to have a DRP as there will be nothing to return to normal.

The plans should each be developed with knowledge of the other, including key integration points, joint responsibilities between the BC and DR teams in the event of a disaster, and definitions of both common and unique resources. These plans will include many of the same members of the workforce and serve the same ultimate goal of helping the organization continue to function during and after a disaster, but they are separate plans.

Plan Creation

Performing a BIA is the first step in any comprehensive BCP. The BIA identifies the impact of process/system disruption and helps determine time-sensitive activities and process dependencies. It also identifies critical business processes and their supporting systems, data, and infrastructure. The BIA is essential to determining requirements and resources for achieving the RTO and RPO necessary for business continuity.

Systems can be grouped in a number of ways, such as by identifying critical processes, important processes, and support processes, along with the systems and data that support these processes. Critical business processes are those that impact the continued existence of the organization. Email, for example, is rarely a critical business process. For a company like Amazon, inventory, purchasing, and payment may be critical processes, and the systems that support them are also critical to the continued functioning of the company.

Along with the selection of critical processes is a prioritization of systems. If you can bring up only one system at a time, which is first? Some systems will not work until other systems are back online, such as a single sign-on (SSO) required for users to access other systems. In that case, the support system recovery must be prioritized before the processes that depend on it can be recovered. All critical processes and systems must be prioritized over other processes.

Important processes should be prioritized after critical processes are recovered. These are processes that are a value-add and can have an impact on the organization's profitability or smooth operation but are not absolutely essential for survival. Once again, these must have a prioritization or ordering.

After the important processes are online, other useful and beneficial processes are brought back online. It is possible that a business decision may be made to not restore some processes until restoration of normal operations is achieved. In the event of a disaster, an employee social calendar app is unlikely to be useful, so restoring this system before the disaster is concluded is not helpful.

Processes based on legacy systems can be difficult to return to pre-disaster configurations. The technical debt of continuing to run legacy systems unless essential to the business may make them expendable. If there are systems that may be in this category, it may be worthwhile to consider replacing and retiring them prior to an emergency. Legacy systems often have legacy hardware and software requirements that are not easily restored after a disaster.

The challenges associated with legacy system recovery are an operational consideration and may be compounded by security risks like lack of support for SSO or MFA. These risks should be accounted for as part of risk assessment activities, and the proper mitigation may be a migration to a newer platform. One advantage to a cloud-based BCP/DRP is the expectation of a modern and up-to-date infrastructure. While a legacy system could potentially be hosted in the cloud, the move to a cloud position may provide the opportunity to modernize.

The creation of the DRP often follows the BCP. Knowing the critical business processes and the supporting infrastructure needed provides a roadmap to returning the system to pre-disaster operations.

Both the BCP and DRP should be created considering a range of potential disasters. History is a good starting point for environmental disasters, and government agencies such as FEMA and private/public partnerships like InfraGard can provide guidance on the likely disasters that need to be considered in any geographic location. The CSP is also a source for BCP/DRP planning. Recommendations and solutions for surviving specific business disruptions and returning to normal operations are from major CSPs, and security reference architectures published by the CSPs highlight BC and DR best practices and recommendations.

The creation and implementation of a DRP or BCP is a lengthy process. It is simplified by having the system criticality and prioritization determined prior to beginning plan development. The identification, criticality, and prioritization of business processes, systems, and data are a necessary first step to creating a complete and functional plan. If this prerequisite step is omitted, the creation of the plan can be disruptive. Often, every business unit and all process owners view their processes, systems, and data as important, if not essential. It is important for senior leadership to make these decisions in advance so that each group knows their place within the plan for bringing the organization back online.

BCP/DRP Implementation

Identifying key personnel is a crucial step in creating and implementing a BCP or DRP. The executive sponsor provides key resources for creating the plan. Implementation of the BCP/DRP can include identifying alternate facilities, contracts for services, and training of key personnel, and this process requires the resources the executive sponsor brings.

Implementing BCP or DRP processes may necessitate utilizing cloud computing for critical services, which takes advantage of the cloud's high availability features like multiple availability zones, automatic failover, and even direct connection to a CSP. These choices come with costs that must be considered. The cost of high availability in the cloud is generally less than a company trying to achieve high availability on their own, and if possible, the cost of building resiliency may be far less than the cost of business interruption. By identifying the critical business processes, a business can also avoid the cost of implementing high availability for noncritical systems. In many conversations on availability with business process owners, it has become clear that everyone wants high availability, but justifying the associated costs is a good way to identify those systems that do not actually need it.

Critical business processes can be supported through component/data center redundancy and may run as an active-active configuration. This allows near instantaneous continuation of critical processes. Important but noncritical business processes may use the less expensive active-passive configuration. This allows a more rapid restoration of services when a specific region or zone becomes unavailable. Less important business processes may be run in a single region or zone and may take more time to return to service, and this configuration usually operates at a much lower cost.

Methods to provide high availability and resiliency continue to evolve. The ability to automate monitoring and deployment through orchestration and other methods supports high availability in the cloud and automated failure recovery that is faster than manual recover options. New tools and methods will continue to be developed that should lead to increasingly resilient cloud environments at attractive prices.

BCP/DRP Testing

A BCP and DRP should be tested at least annually. This test should involve the key personnel needed for all disasters. In addition, many tests will include a roleplaying scenario that involves members of the workforce who have responsibilities under the BCP or DRP. This training is useful in the event of a disaster, as employees will be familiar with processes and able to execute them more quickly and effectively in stressful situations.

There are some basic scenarios that apply to most if not all organizations. It is not necessary to test all of them each year, but a robust plan should test all likely scenarios over time. A well-tested plan will function even for an unexpected disaster. Common disaster scenarios include the following:

  • Data breach
  • Data loss
  • Power outage or loss of other utilities
  • Network failure
  • Environmental or natural disasters (e.g., fire, flooding, tornado, hurricane, or earthquake)
  • Civil unrest or terrorism
  • Pandemics

The plan should test the most likely scenarios first and can be tested in a number of ways. The maturity of the plan and the people implementing the plan will determine the type of testing that takes place. There are a variety of standard test methods.

Tests should be both scheduled and unscheduled. Particularly for new plans and organization immature in the BCP/DCP space, a scheduled test ensures that key personnel are available and will begin the maturation process. Tests that have the potential of being very disruptive should also be scheduled to minimize disruption.

Eventually, some tests should be conducted without prior warning. Surprise tests do not mean unscheduled. Instead, only high-level approvers and a few key individuals are aware of an upcoming test, which gives a more realistic result since members of the recovery team respond as if an actual disruption took place. The high-level approval is essential in case some amount of unexpected business disruption occurs. Executive approval includes a review of potential disruption and benefits so that a business decision can be made by those responsible. Key personnel who are part of the test planning can be positioned in advance of the test in order to monitor performance metrics and to call a halt to testing if a genuine disruption occurs as a result.

The simplest test method is a tabletop exercise. A tabletop is usually performed in a conference room or other location around the “table,” where key personnel are presented with a scenario and then work through the plan verbally. This is usually the first step for a new plan and can identify missing pieces in the plan or steps that need greater detail. The next step may be a walk-through where key personnel move to the appropriate locations and verbally verify the steps, sometimes even performing some parts of the plan. While developing a plan, regular tabletops and walk-throughs can help flesh out a more robust plan, identify incorrect assumptions, and even test logistics. A tabletop or walk-through can also be useful for new members of the team to assist them in learning their responsibilities under the plan.

A more substantial test is a simulation. Like a fire drill or a shelter-in-place activity, a disaster is simulated, and the plan is exercised while normal operations continue. A simulation is more robust and detailed than a walk-through, with personnel performing certain steps simulating their response to a disaster, which may include actual work like moving to an alternate facility or spinning up cloud resources that would need to be provisioned in the event of a disaster.

The next level of testing is a parallel test. Care must be taken not to disrupt normal business operations when performing a parallel test, which requires key personnel and workforce members to perform the steps needed in case of a disaster. More than simulating the steps, they actually perform the steps to ensure that they can accomplish the critical business processes using the actions documented in the plan. Processing happens in parallel with the still-running primary systems, and parallel testing can be useful to ensure that systems are able to handle current workloads or produce expected outputs. The output of the parallel test can be compared to outputs from actual production environments to identify any gaps.

The most robust level of testing is a full cutover test. In this test, the actual steps to be taken in a disaster are performed. The primary system is disconnected, and the business attempts to run critical functions. This is a high-risk test, as the critical functions may fail, so only the most mature organizations and BC/DR teams should attempt a full cutover test.

Summary

Moving to the cloud is a business decision, often driven by the financial change of shifting from capital expense to operating expense. Cloud environments offer native high availability and resiliency at a price that is attractive to businesses of all types. They also provide capabilities not available to many organizations, such as active-active configuration without the need to fully build out multiple data centers. To achieve these benefits, a customer must have an understanding of the key infrastructure pieces of a cloud environment, architect systems that take advantage of these features, and configure the features correctly. The customer must also understand the shared security responsibility between the CSP and the customer. Finally, a customer must understand how to properly configure and use the cloud resources they purchase to ensure that appropriate controls are in place to secure the process and data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset