10. Planning your business continuity strategy

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10. Planning your business continuity strategy

In this chapter, you will learn about:

There are many factors that must be evaluated during the planning phase of your Microsoft SharePoint 2013 farm, and one of the most crucial aspects is ensuring that the implementation continues to operate when unforeseen issues plague the environment. Up until this point, you may have started to formulate a design of how everything should be structured, but does your design include protecting your investment?

Protecting SharePoint is often overlooked. One possible reason for this may be that there isn’t a feature in SharePoint that you can enable that automatically protects your implementation. You can visit Central Administration and perform full-farm or granular-level backups, but these features don’t promote the continuity of business. They are more or less an option for disaster recovery.

Disaster recovery (DR) certainly has its place within how you are going to plan your SharePoint environments, and this chapter will focus on some of the features throughout. A DR plan is great for when you need to recover a deleted item within SharePoint, but it doesn’t help keep SharePoint from service disruptions. First, how does SharePoint fit into your organization? More specifically:

What is the role of SharePoint in your organization?
Do you have plans to integrate it with your accounting software or other aspects of your business?
Do you consider SharePoint to be mission critical?
Have you performed a business impact analysis (BIA) on your environment?

In this chapter, you’ll learn what needs to be protected and how to plan for this, as well as what the business continuity management (BCM) objectives are and how they might define an organization’s tolerance toward data or service loss. You’ll then use that information to explore options for mitigating risk.

Planning for your business continuity needs

The only sure way to be successful in protecting your data is to properly plan what you are going to protect and then how you are going to protect it. The first step is to perform a BIA to determine which business processes are critical to your business and then identify the core components that those business systems rely on. In SharePoint, everything comes down to one piece: the data. This could simply be the database, but in some cases it may include Remote BLOB Storage (RBS) shares.

Understanding what to protect

The databases in SharePoint are the most common protection points and are considered the principal vehicles for service and recovery of service. If you lose a SharePoint server that is servicing the Web role, rebuilding your farm may be difficult if you have custom solutions improperly applied to the farm or have custom web.config files in your web applications, but overall, the data is still available and the farm can be rebuilt. Your objective here is to think outside of the single-server environment to ensure that your system can withstand a failure and that you stay within your BCM objectives.

Applying BCM objectives

BCM objectives are the ultimate requirements guide for how the SharePoint implementation should be built, but they may be difficult to obtain and stay within budget. As you’ll see later in this chapter, each layer of protection comes at a cost. It is up to you to know what the technologies are and how they work; but ultimately, it is up to the business to decide if what they are asking for is worth it. Once the business has dictated their needs, a service-level agreement (SLA) can be created and then honored. An SLA is a negotiated agreement between a service provider and a customer. Often, the service provider may actually be an internal IT group and the customers will be various lines of businesses.

While the SLA may specify several topics, the focus of this chapter is on the tolerable amount of service loss and the tolerable amount of data loss. These are often referred to as Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). To simply these terms even further, examine the following:

RTO. This specifies the acceptable amount of time that the organization can be without the system. This can range from seconds to days. For many small organizations, this is 8 hours.
RPO. This specifies the acceptable amount of data that the organization can lose. For many small organizations, this may be 24 hours.

When stakeholders determine the value that is to be associated with both the RTO and RPO, they are naturally going to determine that zero is the best value. In their mind, they should never lose any data, and it should always be accessible. The problem is that this is an unachievable metric, and the closer you can get to zero, the more expensive it will be.

As you progress through this chapter, you will learn why these zero value-based objectives are unachievable as you gain a greater understanding of each technology. Your task is then to select the technologies that will get you as close as possible to the expectations and still keep the solution within budget. Once the stakeholders understand the costs involved, a few minutes of downtime may be acceptable.

In your requirements documentation, you will want to capture what the goal is. It could be something as simple as, “I can lose 10 minutes worth of data, and all of my data can be inaccessible for 60 minutes.”

Keep in mind that your RTOs and RPOs need to be realistic and potential service disruptions will happen causing a break in availability. Availability is defined with five classes, as shown in Table 10-1.

Table 10-1. Five classes

Availability	Uptime	Downtime
Mission Critical	99.999%	5 minutes per year
Business Vital	99.99%	53 minutes per year
Mission Important	99.9%	9 hours per year
Data Important	99%	3.6 days per year
Data Not Important	90%	36 days per year

If you are building a Mission Critical environment, you have 5 minutes of allowable downtime per year. This doesn’t leave any time for administering the system with patches or upgrades; even database snapshots will bring you over the 5 minutes if not properly discussed in the plan.

You should also keep in mind the amount of time it takes to replace a SAN/RAID, perform a database backup, or even some of the BCM methodologies that will be discussed shortly. Your job is to keep the agreed uptime as low as possible, while the client’s job is to have it as high as possible. Can you really afford to offer 99.999 percent?

Exploring the features native to SharePoint

SharePoint 2013 provides you with several backup and restore options, as shown in Figure 10-1.

Figure 10-1. The backup and restore options available through Central Administration.

In this section, you’ll learn about the backup and restore features that are native to SharePoint 2013, including the following:

Native backup and restore. Provides for full and differential backups
Site-collection backup. Provides the ability to back up a specific site collection
Exporting a site or list. Provides the ability to export a site or list
Recovering data from an unattached content database. Allows for the recovery of data from a SharePoint content database without having to attach the content to a web application
Recycle Bin. Provides an easy way for users and site collection administrators to recover recently deleted items
Read-only improvements. Provides a better experience to the user while the site is in read-only mode

The features in this section are targeted more towards DR, meaning that they help you recover lost data.

Native backup and restore

On the surface, there aren’t any apparent changes to the native backup and restore functionality that is included in the product, but they have made some internal improvements. The farm backup supports both a full and differential backup. The biggest selling point, and what makes it unique, is the ability to back up the configuration, SQL Servers databases, the file system hierarchy, and the state of the search index simultaneously.

Backup

To perform a backup through the UI, click Backup And Restore | Perform A Backup. This will allow you to select the components that you wish to back up, as shown in Figure 10-2.

Figure 10-2. The full farm backup option gives you the ability to select components granularly.

Once you select the components that you wish to back up, click OK. You will now be presented, as shown in Figure 10-3, with the Backup Type options (full or differential), Back Up Only Configuration Settings, and the location where your backup file will be stored.

Figure 10-3. The second step allows you to save to a specified location.

Note

The farm account will need to have access to the backup location.

If you prefer to use Windows PowerShell, the backup farm cmdlet would look like the following:

C:PS>Backup-SPFarm -Directory \file_servershareBackup -BackupMethod full

You may also add the -ConfigurationOnly parameter to back up only the configuration database, but this parameter is intended for inactive farms. If you would like to take a backup of only the configuration data from the active farm, you should use the Backup-SPConfigurationDatabase cmdlet.

You can use the Backup and Restore Job Status link to check in on the progress. Once it is complete, you will then be able to verify the files in your specific backup location. You will need to ensure that you have enough disk space for your backups, so incorporate this into your file system or storage area network (SAN) plans. It is worth noting that SharePoint doesn’t provide a way to schedule backup jobs, but you can use the Windows Task Scheduler to call Windows PowerShell scripts. Any users that plan on using the Windows PowerShell cmdlets for backing up or restoring farm components must be added to the SharePoint_Shell_Access role for the specified database.

Note

The Backup-SPFarm cmdlet may appear to complete without issues, but this does not mean that the backup itself was successful. You should always view the spbrtoc.xml file located in the specified backup directory for any possible errors.

Restore

Once a backup is performed, a restore follows a similar path. To perform a restore through the UI, click Backup And Restore | Perform A Restore (see Figure 10-4).

Figure 10-4. The restore feature allows you to select from a specified location.

Site-collection backup

The backup and restore functionality at the site-collection level gives you a little more granular control over your operations. Besides being a great way to restore content, these operations allow for content migration. You can access the Site-collection Backup feature by clicking Backup And Restore | Perform A Site Collection Backup. This will give you the option to select the site collection to back up along with the storage location of the file, as shown in Figure 10-5. This feature only takes backups and does not offer a UI for restores. In order to restore a backup, you will need to use the Restore-SPSite cmdlet.

Figure 10-5. The Site-collection backup does not have a UI for restores.

Note

The farm account will need to have access to the backup location.

As with many features in SharePoint 2013, there are ways to achieve the same tasks through Windows PowerShell. In the case of backup and restore, there are two important cmdlets: Backup-SPSite and Restore-SPSite.

Backup-SPSite

The Backup-SPSite cmdlet allows you to granularly back up a site collection to a specific path. In the following example, the cmdlet is backing up a site collection located at http://server_name/sites/site_name to the C:Backupsite_name.bak file:

C:PS>Backup-SPSite http://server_name/sites/site_name -Path C:Backupsite_name.bak

Restore-SPSite

The Restore-SPSite cmdlet is the only way to restore a site collection, regardless of whether it was done through the UI or using the Backup-SPSite cmdlet. In the following example, the cmdlet is restoring a site collection located at http://server_name/sites/site_name using the C:Backupsite_name.bak file:

C:PS>Restore-SPSite http://server_name/sites/site_name -Path C:Backupsite_name.bak

Exporting a site or list

Now that you have the ability to back up and restore content, it is now time to review exporting and importing data. You may have noticed the wording has changed for these operations. In the previous sections, you were backing up and restoring the farm or site collection, but in this section, you’re going to be exporting and importing. Although this may seem like a small wording difference, there are significant implications. Specifically, the difference is all about fidelity. There are several things that are not exported, and thus are not available for import. These settings include workflows, alerts, audit logs, personalization settings, and recycle bin items. Other things that may cause issue are dependencies on other lists, so keep this in mind as you’re moving content around.

To export a site or list, click Backup And Restore | Export A Site Or List. The UI should resemble Figure 10-6. You will have the option to select the site or list, with the following results:

If you select a site, you will be able to select whether to include security for that site.
If you select a list, you will also be able to select the type of versions you would like to export.

Figure 10-6. The Export A Site Or List functionality allows you to export security.

You can export a site or list through the UI, but you must use the Import-SPWeb cmdlet for the import.

Export-SPWeb

If you wish to export a site or a list, you have the option to use Central Administration or the Export-SPWeb cmdlet. The following example will export a site at http://site to a file located at C:ackupsexport.cmp:

C:PS>Export-SPWeb -Identity http://site -Path "C:ackupsexport.cmp"

Import-SPWeb

To import a site or list, you must use the Import-SPWeb cmdlet. The following example will import a site to http://site from a file located at C:ackupsexport.cmp:

C:PS>Import-SPWeb http://site -Path C:ackupsexport.cmp -UpdateVersions -Overwrite

Note

The Import-SPWeb and Export-SPWeb cmdlets help documentation erroneously makes references to “Web Application,” suggesting that these cmdlets support an entire web application. This is not correct. These cmdlets support only the export and import of sites and lists.

Recovering data from an unattached content database

In Microsoft Office SharePoint Server (MOSS) 2007, when users needed to grab content out of a copy of a SharePoint content database, they had to create a new web application, attach a database to it, and then browse for the content. In SharePoint 2010, recovering data from an unattached content database was introduced. This feature allows you to point to a content database and then extract what you need without all of the hassle. In Figure 10-7, you can see that you have the ability to select the database and then browse for content, back up a site collection, or export a site or list.

Figure 10-7. Recovering data from an unattached content database allows for easy access to old data.

Restoring data using the Recycle Bin

The Recycle Bin feature provides a safety net for when items are deleted from SharePoint. The Recycle Bin feature is configurable at the SharePoint web application layer and is available via the UI, Windows PowerShell, or the application programming interface (API). The Recycle Bin feature is comprised of two locations: the Recycle Bin and the Site Collection Recycle Bin. It is important to know how each of these come into play. For most deleted objects (documents, list items, lists, folders, and files), they will be placed into the Recycle Bin; this area is available to the information worker and they can recover items here without assistance from the site collection administrator. By default, items that are placed in the Recycle Bin are kept there for 30 days. This value is configurable in Central Administration at the web application level, as shown in Figure 10-8. The item will remain in the Recycle Bin until either the item expires, the item is restored, or the user deletes the file. If the user deletes the file, it will then be placed in the site collection administrator Recycle Bin, where it will remain until the end of the expiration period.

Figure 10-8. The Recycle Bin is configurable for each web application.

If a SharePoint website is deleted, it will go directly to the site collection administrator’s Recycle Bin, where only someone with site collection administrator rights can restore the item. The items will remain there until the item is restored, manually deleted, the item expires or the quota limit is reached. These items are now permanently deleted.

The Recycle Bin feature is often referred to as stages, and this is true in the UI as well. This can be a bit misleading, as items do not have to pass through both areas. Items that are deleted by the user and complete their expiration period in the Recycle Bin will not be moved to the Site Collection Recycle Bin. SharePoint websites go directly to the Site Collection Recycle Bin and a confirmation message is shown to the user, as shown in Figure 10-9. If a Site Collection is deleted, there is no Recycle Bin to move the item into, but they can be recovered using the Restore-SPDeletedSite cmdlet.

Figure 10-9. The Recycle Bin sends sites to the Site Collection Recycle Bin.

Note

The Gradual Site Delete timer job is used to gradually remove content from a site to ensure there are no SQL locks from an instant delete. This has no effect on the site until it leaves the Recycle Bin.

Read-only improvements

SharePoint 2010 introduced improvements on the information worker’s experience with read-only sites but the read-only state was not obvious. The write features that were available to Site Owners and Members would be removed, and they would have similar experiences to those who had strictly read permissions. For users who expect the ability to save data to the database, they may be left wondering why the system no longer allows them to do so. SharePoint 2013 has expanded on the read-only feature by showing a message bar stating: “We apologize for any inconvenience, but we’ve made the site read-only while making some improvements,” as shown in Figure 10-10.

Figure 10-10. SharePoint 2013 has a read-only status message.

Further information may be given to the user by setting the ReadOnlyMaintenanceLink property that is available through the web application object. This property can be configured either by using the SharePoint API or through Windows PowerShell. The ReadOnlyMaintenanceLink property can be set to any working webpage. If the page is local within SharePoint, only the relative URL is required, as shown in Example 10-1.

Example 10-1. Configure the ReadOnlyMaintenanceLink

C:PS>Add-PSSnapin "Microsoft.SharePoint.PowerShell" -EA 0
C:PS>$webApp = Get-SPWebApplication -Identity http://www.contoso.local
C:PS>$webApp.ReadOnlyMaintenanceLink = "/SitePages/ReadOnly.aspx"
C:PS>$webApp.Update()

Once the ReadOnlyMaintenanceLink property has been set, a “More information” link will be appended to the read-only status message that will send the user to a page that may have detailed information about the maintenance operation, as shown in Figure 10-11.

Figure 10-11. The ReadOnlyMaintenanceLink allows for more detailed messages.

If the “More Information” link is no longer required, simply setting the property to an empty string will put the status message back to its original format:

C:PS>$webApp = Get-SPWebApplication -Identity http://www.contoso.local
C:PS>$webApp.ReadOnlyMaintenanceLink = "/SitePages/ReadOnly.aspx"
C:PS>$webApp.Update()

Avoiding service disruptions

Now that the RPOs and RTOs have been defined, it is time to plan the level at which you will protect your implementation from service disruption. The options in protecting your data are endless. People plan out everything from stretched farms to a Microsoft SQL Server backup. This section highlights the pros and cons of several options, but first, you will learn more about fault domains at an abstract level.

What do you plan to protect against? You can scale out the best SharePoint farm in the world, with multiple servers hosting your environment so that if one server goes down service continues, but if all of your servers are on the same rack, you can easily have a service disruption when you lose power or network connectivity to the rack. So now we add a second rack, as shown in Figure 10-12. Let’s think outside of the rack for a minute, because you can easily have a power outage that affect the entire building, or even an entire city. If this happens, is this acceptable? It may very well be.

Figure 10-12. SharePoint server components should span multiple racks to allow for fault tolerance at the rack level.

If you are powering an intranet for a small organization and you lose power to the building, it is safe to say that everyone loses power and everyone walks outside and hangs out while waiting for power to be restored. In this day and age, the small organization where everyone is on site is quickly going away. More and more companies are starting to support remote access, so the same power disruption can now affect more people than just those inside the building. Will the same levels of tolerance be acceptable for a large Internet company? Think about the revenue that could potentially be lost if a large online company lost power to their data center for even a few hours.

What about natural and artificial disasters? If a primary data center is in California and they have an earthquake, how long can your system be down before it causes a problem? Do you then plan to have multiple data centers? If so, what risks are you mitigating?

Review Figure 10-13. If California has an earthquake, it is probably safe to say that it will not affect Colorado. If the data center in Colorado gets hits by a tornado at the same time as your earthquake in California, you are having a bad day—but a third data center in Minnesota might do the trick. At this point, you must ask yourself if data center redundancy is going to the extreme. How often do people lose entire data centers? Are the benefits worth the cost? What fault domains are most vital to your implementation?

Figure 10-13. In order to protect against large-scale disasters, it may be necessary to have multiple data centers.

Without the database, there is no SharePoint farm. If you can protect the databases, the argument can be made that your farm can be recoverable. If you lose your data, no trick in the world will help restore your farm. If you lose a server that is hosting SharePoint, it may be painful to reconfigure some of the web applications, but the data is still intact.

With this in mind, the most common protection point would be the databases. In the next section, you will investigate the options you have to protect your databases.

Implementing various business continuity techniques

How do you take what you’ve learned so far and put those ideas into action? Some of the requirements for SharePoint 2013 that have changed from the infrastructure perspective are the supported operating systems and SQL Server machines. Here are the minimum versions that are supported for the new version of SharePoint:

Windows Server 2008 R2 with SP1
Window 8 Server RTM
SQL Server 2008 R2 SP1
SQL Server 2012

Note

Prior to choosing the version of SQL Server, be aware that some features may only work with SQL Server 2012.

Microsoft offers several technologies to help you protect your databases. In this section, you will learn about failover clustering, database mirroring, log shipping, and AlwaysOn Availability groups.

The first three may be familiar to you, at least at a high level, but the SQL Server team has delivered new features in SQL Server 2012 that are beneficial in this space. Each of these concepts should be in your mind as you are trying to find the right solution.

Failover clustering

Failover clustering is a software-based solution that was introduced in the days of Microsoft Windows NT Server 4.0 and SQL Server 6.5 Enterprise edition. Since then, that technology has been given the opportunity to mature and is now a proven method of handling hardware issues. The RPO and RTO are extremely low, but is technically not zero. While the data that is stored on the shared storage device will remain intact, the in-flight transaction will be lost. There will be a small disruption in service while the failover occurs from one node to the next, which may average between 30 seconds and 2 minutes.

Failover clustering consists of a series of nodes (computers) that protect a system from hardware failure. In Windows Server 2008 Enterprise edition, you can support up to 16 nodes. This has been dramatically increased to 63 nodes in Windows Server 2012. To cluster SQL Server, you must cluster both the server operating system and SQL Server. The nodes monitor the health of the system and can step in when issues are detected; more about this shortly.

Note

SQL Server Standard edition supports two nodes for failover clustering.

Failover clustering is typically done with a SAN, but keep in mind that the SAN may be a single point of failure.

Recently, my team built a SharePoint system that would handle most issues. I configured the web servers in a load balancer and strategically placed all of the services on multiple servers. I made sure that the servers were hosted on different racks and each had a separate power source. I had configured SQL Server 2008 R2 on a cluster, and all of our BCM tests were successful. Overall, I was feeling pretty happy about it. Imagine my surprise when I received a call from my client that everything was down. I couldn’t imagine what had happened. After remoting into the system, the cluster reported that the storage device was no longer available. The client had mistakenly pulled the SAN that held all of the SQL Server data instead of the intended one, which was for Exchange. So the moral of the story is that while you may set up a failover cluster, if you are using a SAN, you still have a single point of failure.

Now you can get around this scenario by configuring a multisite cluster (also known as a geocluster). SQL Server 2012 Enterprise has been released with enhancements for geoclusters, but it should be noted that SharePoint has some strict support requirements between the communications between the SharePoint servers and the SQL Server machines, so if there is any measureable distance between the geoclusters, it may not be supported. You will learn about this further in the next section when you approach the topic of stretched farms.

How it works

Let’s go back and visit how the cluster works. As discussed previously, a failover cluster is a set of independent computers or nodes that increase the availability of software, which in this case is SQL Server. Each node in the cluster has the ability to cast one vote to determine how the cluster will run. The voting components, which may or may not be restricted to nodes, use a voting algorithm that determines the health of the system; this is referred to as a quorum. There are a few variations on how the quorum works: node majority, node + disk majority, node + file share majority, and no majority.

Node majority. In this configuration, each node is a voting member and the server cluster functions with the majority vote. This is preferable with an odd number of nodes.
Node + disk majority. In this configuration, each node is a voting member as in node majority, but a disk-based witness also has a vote. The cluster functions with the majority vote. This is preferable for single-site clusters with an even number of nodes. The disk majority witness is included to break a possible tie.
Node + file share majority. This configuration is similar to the disk majority in that each node has a vote and it uses a disk-based witness. This is the preferred configuration for geoclusters with an even number of nodes. The disk majority is included to break the tie.
No majority. This configuration functions with a single node and a quorum disk. The nodes that communicate with the quorum disks are voting members. This configuration is one of the most common for SQL Server implementations.

If a quorum detects a communication issue, failover will occur and another node will step in.

Note

For more information on how a quorum works, see http://technet.microsoft.com/en-us/library/cc730649(v=ws.10).aspx.

The failover cluster can be configured in a wide variety of ways. For the purpose of this discussion, you will focus on the following configurations: Active-Passive, Active-Active, and Active-Passive-Active.

Active-Passive

In the Active-Passive configuration, one node is active, while the second one is passive or inactive. The biggest complaint about this type of configuration is that the inactive node is idle—in other words, it’s not performing any work. If the active server should fail, failover would occur, making the second server active and the first server inactive. At any given moment, only one server is available for activity.

Active-Active

In the Active-Active configuration, both nodes are active and are available for activity. The issue with this type of configuration is that in the event of a node failure, the surviving node must have enough resources to handle the load of both servers. This may leave the environment in a reduced capacity.

Active-Passive-Active

The Active-Passive-Active configuration is a solid plan over the two previous designs. In Active-Passive, 50 percent of the hardware is idle and is perceived to be “wasted.” In Active-Active, both servers are working, but if a failure happens, a reduced workload is probable. In Active-Passive-Active, or additional variations, a smaller percentage of the servers are idle. This configuration could have any number of active servers that are supported by any number of inactive ones. If you have a single inactive node and two active nodes fail, a reduced workload is probable.

Disadvantages

Now that we have highlighted the advantages of failover clustering and how it works, it is time to discuss some of the disadvantages of the technology. You may have already realized this, but failover clustering only solves the high availability of your data and doesn’t give you a way to recover data in case of a cluster failure. You will need to look at other technologies for DR.

Another disadvantage is that all of the drivers, firmware, and BIOS versions must match on each of the nodes in the cluster.

Finally, while there are a lot of resources on Microsoft TechNet that can guide you through the process, configuring failover clustering can still offer challenges.

Database mirroring

Database mirroring is a software-based solution that enables you to add a layer of resiliency to highly available architectures—at a granular level not available to failover clustering. While failover clustering is configured at the SQL instance layer, database mirroring is a solution that is configured at the database level.

Database mirroring has been achievable for several products now, but it wasn’t until SharePoint Server 2010 that it was integrated into the product. In MOSS 2007, database mirroring required database connection strings in both an unmanaged SQL layer and in managed code. The administrator was required to use SQL Aliases across the SharePoint servers and required the database principal to maintain node majority, meaning all of the principal databases were on the same instance and had to fail over together. Now with database mirroring as part of the product, the failover database is specified when the SharePoint content web application is being created, and the complexity of the configuration has been diminished. The following Windows PowerShell script demonstrates how to set the failover service instance:

$server = "Contoso_SQL_1"
$databaseName = "WSS_Content_Contoso"
$db = get-spdatabase | where {$_.Name -eq $database}
$db.AddFailoverServiceInstance($server)
$db.Update()

Similar to failover clustering, database mirroring will ensure no stored data loss, but it has many advantages over failover clustering. The first is the perceived hardware under usage. As previously discussed, in a failover cluster, you generally have an inactive node. If you do not, then the active nodes will need to support the addition load, in which case a performance hit would be expected. In database mirroring, a given server may host the principal for one database and a mirror for another. This allows a configuration to use the hardware without having to fail over to it.

Database mirroring also makes use of a redundant I/O path, where as the failover cluster does not. What this means is that if a dirty write affects the principal, it will not affect the mirror—more about this in the next section. Furthermore, when the primary server detects a corrupt page on disk, the mirror can be used to repair it automatically without any intervention. Finally, database mirroring is not dependent on shared storage, so it is not suspect to the same single point of failure as the failover cluster is.

The hardware on the failover clusters must also maintain the same drivers, firmware, and BIOS versions, whereas database mirroring is not dependent on the server binaries to be in sync. Database mirroring also supports sequential upgrade of the instances that are participating in the mirroring session.

How it works

Database mirroring maintains two copies of a single database on separate SQL Server instances. A transaction is applied to the primary or principal instance and then replayed on the mirror. This relationship is known as a database-mirroring session. Since these are using two distinct I/O paths, dirty writes or torn pages that affect one instance will not affect the other.

Note

Database mirroring is only supported in Full recovery mode.

The principal instance is responsible for handling requests to the client, while the mirror acts as a hot or warm standby server.

Hot standby. A method of redundancy in which the principal server and mirror server run simultaneously and the data is identical on both systems.
Warm standby. A method of redundancy in which the mirror server runs in the background of the principal server. The mirror server is updated at regular intervals, which means that there are times when the data is not identical on both systems. There is a possibility of data loss while using this method.

Database mirroring works at the physical log record, which means that for every insert, update, or delete operation on the principal, the same operation is executed against the mirror. These transactions will run with either asynchronous or synchronous operations, as described next:

Asynchronous. Transactions on the principal commit without waiting for the mirror server to write the transactions to disk. Since the principal does not wait, this operation maximizes performance.
Synchronous. Transactions occur on both the principal and mirror. There is longer transaction latency under this operation, but it ensures data consistency.

Database mirroring supports both manual and automatic failover. In order to take advantage of the automatic failover functionality, a witness server must be in place. The witness server can be any instance of SQL Server, including SQL Express, since it does not serve any of the databases. An example of this is shown in Figure 10-14. This same witness can reside over multiple mirroring sessions and acts as part of the quorum for the mirror. The witness is not a single point of failure, as the principal and mirror participate in the quorum as well. In order for the mirror to become the principal, it must be able to communicate with one other server.

Figure 10-14. A witness server can preside over multiple mirroring sessions.

There are three mirroring operation modes: high-availability mode, high-safety mode, and high-performance mode. These modes operate as follows:

High-availability mode runs synchronously. This requires a witness, and the database is available whenever a quorum exists and will provide automatic failover.
High-safety mode runs synchronously. This ensures data consistency, but at the risk of hurting performance. With respect to the SharePoint server, when the primary database instance sends the transaction to the mirror, it will want to send the confirmation back to the SharePoint server until the mirror confirms the transaction. To see an overview of the process flow, refer to Figure 10-15.
High-performance mode runs asynchronously. Since the transaction does not wait for the mirror to commit the log, it is faster but is also at risk for data loss. With respect to the SharePoint server, when the primary database instance sends the transaction to the mirror, it also sends a confirmation to the SharePoint server and is not delayed by the additional write.

Figure 10-15. Diagram of data flow for database mirroring.

In high-safety mode

The client writes data to the principal SQL Server database (steps 1 and 2). The principal writes the data to the transaction log and the data is committed (steps 3 to 6). The principal sends the transaction to the mirror (steps 7 and 8) and waits for the mirror instance to commit. The mirror instance write the data to the transaction log and the data are committed (steps 9 to 12) and the mirror confirms with the principal (steps 13 and 14). The principal confirms the transaction back to the client (steps 15 and 16).

In high-availability mode

The client writes data to the principal SQL Server database (steps 1 and 2). The principal writes the data to the transaction log and the data is committed (steps 3 to 6). The principal sends the transaction to the mirror (steps 7 and 8). The principal confirms the transaction back to the client (steps 15 and 16). The mirror instance write the data to the transaction log and the data are committed (steps 9 to 12) and the mirror confirms with the principal (steps 13 and 14).

Note

The mirror database is in a recovery mode and is not accessible by anything but SQL Server. Therefore, you cannot do reporting operations on the mirror. If you require this type of functionality, review the Log shipping section later in this chapter.

Disadvantages

Now that you have encountered the advantages of database mirroring and how it works, it is time to discuss some of the disadvantages of the technology. Database mirroring is not supported by all SharePoint databases. This is constantly evolving, so check the latest guidance to find out which databases are supported.

There are many practices that change the recovery mode on their databases to Simple. This is primarily done to restrict the logs from accumulating on the server when other backup methodologies are used. Database mirroring requires Full recovery and therefore will not work in this mode. Some databases, like the Usage And Health Service database, may be in Simple recovery mode for performance, as shown in Figure 10-16.

Figure 10-16. The Usage And Health database utilizes the simple recovery model by default.

Log shipping

Log shipping is a software-based solution that increases database availability. Similarly to database mirroring, log shipping is implemented on a per-database level. Log shipping offers several benefits over database mirroring. In database mirroring, there is only one mirror or secondary storage. In log shipping, the primary database sends backups of the transaction log to any number of secondary databases. You may also remember that in database-mirror sessions, the mirror database is not readable. Log shipping supports read-only secondary copies that are suitable for reporting.

Finally, one of the greatest benefits of log shipping is that geographic redundancy is approachable. Geographic redundancy and the concept of stretched farms will be discussed in a little more detail in the Putting it all together section at the end of the chapter.

How it works

Log shipping is nothing more than a backup, copy, and restore of the SQL transaction logs from one server to another. The steps include:

Back up transaction log from the primary database.
Copy the transaction log to the secondary server(s).
Restore the transaction log on the secondary server(s)

All three of these operations run at specified intervals. It is important to note that in order for the transaction logs to be restored to the secondary server(s), all users must exit so that database transactions can accumulate while the database is in use. Because of this, log shipping is used more for disaster recovery, as opposed to high availability.

Note

The SQL agent service has to be running for log shipping to work.

The history and status of the log shipping jobs are saved locally by the log shipping jobs. The primary server will keep the history and status of the backup operation, while the history and status of the copy and restore operations will be stored on each of the secondary servers. Once log shipping is configured, you can implement a monitor server that will consolidate the history and status of all of the servers and give you a single location to review them. Review Figure 10-17 for the data flow.

Figure 10-17. The monitor server consolidates the history and status from the other servers.

The log shipping data flow in Figure 10-17 demonstrates the following:

The backup job runs on the primary server and stores the files on the backup share.
The copy jobs runs on the secondary server(s) and the restore jobs run.
The monitor server consolidates the history and status from the other servers.

Disadvantages

Now that you have seen the advantages of log shipping and how it works, it is time to discuss some of the disadvantages of the technology. Log shipping does not support automatic failover from the primary server to the secondary server. The data on the secondary server(s) may not be up to date, depending on the last time the transactions were restored. Log shipping is not a methodology for high availability and has a much higher RTO/RPO than database mirroring and failover clustering.

AlwaysOn Availability

HADRON (High Availability Disaster Recovery—AlwaysOn) or AlwaysOn Availability is a feature of SQL Server 2012. It is considered to be the next generation of high-availability solutions. While AlwaysOn Availability may have similarities with other technologies, it isn’t built on any existing technologies and was designed from the ground up.

At first glance, you will undoubtedly see similarities with the previously mentioned technologies. It offers the high availability of database mirroring, along with the power of geoclustering of log shipping.

How it works

The first thing to take into account is that AlwaysOn Availability does require nodes to be configured using Windows Failover Clustering, which is a feature of Windows Server (review Figure 10-18). This may give you the impression that configuring AlwaysOn Availability may be complicated, but it is rather straightforward and the wizards that are included at both the clustering aspects of Windows Server and HADRON make it very easy to set up. While clustering is required at the server level, SQL Server 2012 itself is not configured for clustering.

Figure 10-18. HADRON requires failover clustering at the Windows Server level.

Once Windows Failover Clustering is configured, SQL Server 2012 will have new AlwaysOn options available from the SQL Server Instance properties windows, as shown in Figure 10-19. Simply select Enable AlwaysOn Availability Groups and click OK.

Figure 10-19. Select AlwaysOn Availability Groups from the SQL Server Configuration Manager.

Enabling AlwaysOn Availability Groups creates an AlwaysOn High Availability folder under the SQL Server instance that allows for the management of Availability Groups. Availability Groups are collections of SQL Server instances that are available for database operations by only specifying the Availability Group Listener. Availability Group Listeners will be discussed shortly.

To create a new Availability Group, you have the option of creating one manually or by using a wizard. The wizard will display all of the databases that exist in the instance but only allow you to select the ones that meet the prerequisites, as shown in Figure 10-20. It will then allow you to specify replicas for the databases. Replicas are basically the various nodes that were included in the Windows Failover Cluster, but are individual SQL Server instances. These replicas may be configured for automatic failover, synchronous or asynchronous operations, or even a readable secondary database. If you remember from the previous section, database mirroring did not support readable secondary databases; so this is a huge advantage over database mirroring.

Note

Database mirroring has been marked as deprecated in SQL Server 2012, which means that while it is supported in SQL Server 2012, it may not be a feature in the next version of SQL Server.

Figure 10-20. You can select only databases that meet the prerequisites.

The replicas are accessed by the Availability Group via the Availability Group Listener. The listener is configured in SQL Server and is a virtual network name that points to the active node in the Availability Group for that particular database. Once the Availability Group Listener has been specified, the following data synchronization options will be available:

Full. Starts data synchronization by performing full database and log backups for each selected database. These databases are restored to each secondary and joined to the Availability Group.
Join only. Starts data synchronization where you have already restored database and log backups to the database. These databases are restored to each secondary and joined to the availability group.
Skip initial data synchronization. Choose this option if you want to perform your own database and log backups of each primary database.

If database mirroring hasn’t been previously configured, select the Full option. This greatly reduces the amount of work to configure AlwaysOn Availability. Once the wizard completes the data synchronization, one database will be synchronized while the secondary is restoring. It is possible to have other states, depending on how the replicas have been configured, as shown in Figure 10-21.

Figure 10-21. One database will be synchronized while the secondary database will be restoring.

If you are familiar with database mirroring in SharePoint, you’ll notice that the configuration of AlwaysOn Availability is simpler since you specify the AlwaysOn Availability Group as the main database and the failover database is not needed.

Note

SQL Server aliases should be used in your configurations. This reduces the overhead of converting your hardware availability to AlwaysOn.

One of the key advantages of AlwaysOn Availability is the lack of dependency on shared storage. It also utilizes a separate I/O path so each database has the benefit of torn page detection. AlwaysOn also provides the benefits of log shipping and allows for geoclustering and multi-subnet clustering.

If your implementation has database mirroring configured, it is easy to convert to AlwaysOn Availability by following these steps:

Remove or break the mirror at the SQL Server instance level.
Create an Availability Group.
Create an Availability Group Listener.
Select the synchronization preferences.
Change the aliases on the SharePoint boxes to point to the Availability Group Listener instead of the principal database server.

Disadvantages

The configuration of AlwaysOn Availability is easier than SQL failover clustering, database mirroring, and log shipping and offers all of the positives of each of the technologies. The disadvantage of the technology is that it requires SQL Server 2012 Enterprise. While database mirroring is available in SQL Server 2012 Standard, the technology has been depreciated in SQL Server 2012 and may not be carried forward into the next version.

Implementing network load balancing

There are many network load balancing options available, most of which come at a cost. Microsoft introduced a free IP load balancer with Windows NT Server 4.0, Enterprise edition, and it continues to exist in the server operating systems that support SharePoint 15.

The network load balancer (NLB) that comes with Windows Server is a feature that can be turned on to allow requests to be directed at a series of nodes. In the SharePoint world, this is done on the SharePoint servers that are being used to support web requests from the users. If one of the servers in the NLB cluster goes down, the requests will be directed at the next available server in the cluster.

When you create a new cluster, you will specify a virtual cluster IP address, subnet mask, and a cluster mode, as shown in Figure 10-22.

Figure 10-22. Cluster settings.

The cluster mode supports the following unicast and multicast methods:

Unicast. The cluster adapter for the hosts are assigned the same unicast media access control (MAC) address. This requires an additional adapter to be installed for non-NLB communications between the cluster hosts. A second network adapter is required for peer-to-peer communications. There is also a possibility of switch flooding if the cluster is connected to a switch because all of the incoming packets are sent to all of the ports on the switch.
Multicast. The cluster adapter for the hosts retain their individual unicast MAC addresses, and they are all assigned a multicast MAC address. Unlike the unicast method, communications between the cluster hosts are not affected because each cluster retains a unique MAC address.

Once the cluster has been created, server nodes are added and will be assigned a host priority, as seen in Figure 10-23. The host priority specifies the node that will handle cluster traffic that is not governed by port rules. If this node fails or goes offline, traffic will then be handled by the server with the next lowest priority. In an active/passive mode (single host mode), the server with priority 1 is the active node.

Figure 10-23. Network Load Balancing Manager.

Putting it all together

In this chapter, you were introduced to several major topics that all begin with understanding one thing: SLAs. All too often, companies try to incorporate BCM methodologies without understanding what their users both expect and require. Prior to implementing a strategy, one should make sure that there is a mutual understanding between the stakeholders, users, and infrastructure teams of what needs to be protected, how it is going to be protected, who is responsible for putting those measures in place, and who is responsible for testing them. In this section, you are going to expand the focus on what you have learned by creating a HA/DR plan for Contoso.

Contoso requirements

Contoso has a new chief technology officer (CTO) and has decided that its SharePoint 2013 farm should be classified as “business vital.” (Review the classifications mentioned earlier in the chapter to see how this relates to uptime.) The SharePoint implementation is surfacing LOB data using Business Data Connectivity Services (BCS), Secure Store Service (SSS), and Enterprise Search. While the CTO understands that it will be difficult and expensive to build a solution that provides five nines of support, he wants it to be as close as possible. The Enterprise Search service is absolutely critical, as is the LOB system; furthermore, it would be ideal if the databases that weren’t being used to support SharePoint actively could be used for other operations. Consultants need this data to perform their jobs and if SharePoint goes down, the company is crippled. Since the data is critical, it is imperative that no saved data is lost; furthermore, the data should have redundant I/O paths in case of corrupted data. The CTO heard that SQL Server 2012 has improved support for geoclustering, and he wants to leverage as much as possible from the product. Contoso has a worldwide presence and needs to operate in the unlikely event of a national disaster; the company has data centers in San Diego, California, and Denver, Colorado. If something should happen to the San Diego office, Denver is to be used to support the SharePoint implementation.

The SharePoint implementation will be supporting up to 500 concurrent users, and a single hardware failure should not cause a discontinuation of service. In the event of a local disruption in California, the CTO expects the Denver infrastructure to be available for requests within 30 minutes. SharePoint patches and upgrades should not cause a disruption of services from Sunday night through Friday night; the company has planned outage periods when these services should be performed. During these patches, a status message should inform the users of when they can expect full access to the system.

SharePoint items should be able to be recovered by the site collection administrator for 45 days, and in the event that this time expires, the item should be recoverable for up to 6 months by the SharePoint farm administrator. All SharePoint backups will be done by the spAdmin account, and they will be stored in a network share directory. It would be nice if there was a way to automate farm backups.

Key factors

Based on the requirements in the previous section, what factors do you see that give shape to a solution? While there may be more than one solution, here are the key concepts that you will need to plan for to have a successful implementation:

Contoso has a worldwide presence and needs to operate in the unlikely event of a national disaster; there are data centers in San Diego, California, and Denver, Colorado. The latency between San Diego and Denver exceeds the 2-millisecond round trip.
Contoso cannot lose any saved data, and all of its data can be inaccessible for 30 minutes.
The data should have redundant I/O paths in case of corrupted data.
The user accounts that need to perform backup and restore operations will require the SharePoint_Shell_Access role for each database that will be part of the job.
A landing page for site maintenance will need to be created to give the information workers details about the maintenance period.
The ReadOnlyMaintenanceLink for each web application will need to be configured.
The system should support up to 500 concurrent users, and a single hardware failure should not cause a discontinuation of service.
The Enterprise Search service is absolutely critical, as are the LOB systems.
The Recycle Bin will need to be modified slightly to support the 45-day item restore.

Solution

The CTO has made it clear that he would like to have a geoclustered (stretched-farm) environment and that there are two data centers: one in San Diego and one in Denver. The biggest issue is that the latency between these two cities exceeds the Microsoft support limit of 2 milliseconds. This does not mean that both data centers can’t be utilized; it just means that in case of a local disruption, all of the servers must switch over. So long as the SharePoint servers are in close proximity to the SQL Server servers, then having multiple data centers will fit into the equation. Contoso will use DNS to switch between the Primary and DR farms. This will have to be continuously monitored, as the switch needs to happen within 30 minutes to stay within the constraints of the SLA. Due to the need for the second farm to handle 500 concurrent users, a hardware or software load balancer will be needed. In a mission-critical environment, it is recommended that a hardware load balancer be used, but in situations where this may have a budget constraint, the use of Windows NLB will satisfy the requirement. An example of the proposed solution is illustrated in Figure 10-24.

Figure 10-24. A high-availability network with a secondary location.

Contoso does not want to lose any saved data. This automatically brings to mind Windows failover clustering, database mirroring, or HADRON as a solution. Due to the need to have redundant I/O paths, this would eliminate failover clustering, as this is only a feature of database mirroring and HADRON. The need to duplicate data from one environment (San Diego) to another (Denver) either calls for an implementation of log shipping or the use of HADRON. In this case, the CTO is on board with SQL Server 2012, so HADRON offers a great solution. If this was a budget-constrained environment, the combination of database mirroring and log shipping may be more appropriate.

There is only one user that will be performing SharePoint Farm backups. This user will need to be added to the SharePoint_Shell_Access role by using the Add-SPShellAdmin cmdlet. The network share directory will need to give both the SharePoint timer job account (Farm Account) and the SQL Server service account full access to this location. The spAdmin user account will not need permission, as this job happens under the context of the other two accounts.

A landing page will need to be created that updates the information workers on the status of any service disruptions. This can be done within the Site Pages library of SharePoint, as shown in Figure 10-11. Windows PowerShell will then be used to set the ReadOnlyMaintenanceLink property.

The high impact of several of the SharePoint service applications would require that these services are running on more than one farm. Depending on the service and how it taxes the host server, it may be appropriate to have the services running on all four SharePoint servers. The Enterprise Search service application should be running in some form or another on each of the servers. There are multiple components that will need to be planned for.

To complete the solution, a working knowledge of the Recycle Bin and SharePoint cmdlets will be important. The default setting of the Recycle Bin’s first stage is 30 days. This will need to be changed to 45 days, after which the item is no longer retrievable without going to an unattached copy of the database or other granular copy. The SharePoint backup cmdlets can be used to create a Windows PowerShell script file (.ps1) hat can be executed via a Windows Scheduled Task. This would allow for farm backups that are capable of protecting much more than the databases.

There are some additional benefits to this particular implementation. The second farm could be used as a testing area for upgrades, but this would put the company’s SLAs at risk should a fault domain failure require the secondary farm to spring into action.

As you can see, there are many different options to weigh when trying to determine the best solution. The most important factor is to get an SLA that is achievable and then use the advantages and disadvantages of the various technologies to help identify the solution that is right for you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10. Planning your business continuity strategy

Create new playlist

Sign In

Sign Up

Chapter 10. Planning your business continuity strategy

Planning for your business continuity needs

Understanding what to protect

Applying BCM objectives

Exploring the features native to SharePoint

Native backup and restore

Backup

Note

Note

Restore

Site-collection backup

Note

Backup-SPSite

Restore-SPSite

Exporting a site or list

Export-SPWeb

Import-SPWeb

Note

Recovering data from an unattached content database

Restoring data using the Recycle Bin

Note

Read-only improvements

Avoiding service disruptions

Implementing various business continuity techniques

Note

Failover clustering

Note

How it works

Note

Active-Passive

Active-Active

Active-Passive-Active

Disadvantages

Database mirroring

How it works

Note

In high-safety mode

In high-availability mode

Note

Disadvantages

Log shipping

How it works

Note

Disadvantages

AlwaysOn Availability

How it works

Note

Note

Disadvantages

Implementing network load balancing

Putting it all together

Contoso requirements

Key factors

Solution

Table of Contents for
10. Planning your business continuity strategy