All about SharePoint best practice... ask me how
SharePoint DR planning some practical options to consider [ Posted on: 26-September-2008 ]

One part of SharePoint deployments that sometimes tend to fall on the way side is planning and providing a viable and practical option for recovering data in the event of a server failure. In most cases when a solution architecture is proposed with SharePoint you need to consider the following options in order to formulate a viable and practical backup and recovery option. Not only should you plan these it should be tested out and the steps required and the times taken to do an actual recovery noted so that your team is familiar with what to expect should the need arise.

Depending on the type of solution being deployed and the impact it may potentially have on the business you may need to add availability as part of your solution architecture. This post however will showcase how you can successfully create and test a mock recovery plan for a small server farm with 2 web front end servers and a dedicated SQL server. As part of the recovery plan you need to typically consider content recovery (Usually via the built in Recycle Bin), site recovery (Accidental site deletions by site administrators) and the focus of this post disaster recovery. Disaster recovery in this context is when you lose one of the content databases or all of the databases related to the SharePoint deployment.

For example assuming that your solution includes regular SQL server backups of your content databases. You can log ship the databases to a secondary server or a network location. Log shipping is one of the practical and less complex options that you should consider to be part of your deployment. TechNet has detailed documentation on how to setup log shipping on your SQL server and what you need to consider. TechNet > Configuring Log Shipping (SQL Server 2005)

This post uses a simple and practical DR plan that can be implemented to provide basic DR capability. The emphasis is purely on the recovery and not on high availability. High availability usually means complexity and higher cost. Depending on your deployment you should provide these options and the pros and cons. My view is that simple DR is better than no DR and there is simply no excuse for not setting up such a setup from day one of your deployment.

What databases can I recover and what's the process?

The most important databases in your farm are the content databases of your SharePoint deployment. In this post I will highlight the steps needed to recover your content database(s). First of all let's establish the overall process of the 'mock' fail over scenario. Remember this is just a simulation to ensure that all the required steps are noted and documented and followed through. In a real world scenario the steps outlined will result in a period of time that your users will not be able to access the data. As I said before this is not about high availability but a solution to recover data. The idea of the 'mock' fail over is to establish how long this process will take and provide a realistic estimate of the downtime. But most importantly this prepares your system administrators to act on a tested plan.

In this scenario I am not going to consider any of the Configuration or SSP and Search databases. Typically when you plan for DR you should have a standby web front end server pre-configured to match as close as possible with your production server. This typically means that you will have installed WSS or MOSS and created your web applications. For fail over you would also have a SQL instance on standby where you can attach your log shipped database(s) to.

Consider the following diagram.

DRSolution.

This is a somewhat simplistic view of what your DR plan may potentially look like. In this scenario You have a Live (Production) SharePoint farm in Wellington (Wellington is the capital of New Zealand for those not from New Zealand). Auckland which is situated at the top of North Island in New Zealand is where the DR farm is located. As mentioned previously my focus of the post is to highlight the steps in order for the recovery and not how to setup such a deployment.

Overall process which covers the DR steps
  • The primary SQL server (PRD-SQL) and SharePoint WFE (PRD-SPWFE) server is powered down or reaches a non operational stage in Wellington
  • An incident is raised via system operational procedures to invoke DR measures and start the DR process
  • The fail over SQL server (Log ship destination) and secondary SharePoint WFE server is bought online by a system administrator in Auckland or via remote administration from Wellington
  • The log shipped content database is made available via backup/restore for use by system administrator performing the DR operation on the Auckland SQL server
  • The WSS content database is added as the current active content database to the new web application on the Auckland DR web application server (*steps below)
  • Setup Log shipping in the reverse direction to ensure that now you can recover back into Wellington
  • DNS redirect is made to point the DR-SPWFE to serve content
Account pre-requisites for restore and recovery

For the DR plan to be effective you will need to have the following rights setup on the destination (Auckland DR) farm. That is basically the setup accounts used in your Wellington farm and the setup accounts in Auckland should be the same or if they are different they should have the following applied. Previously I have posted about setup accounts and why these are important when deploying SharePoint.

  • On the DR farm the Central Administration Application Pool account should be a member of the dbcreator and securityadmin  fixed server roles on the SQL server
  • All application pool accounts and the search services and default content access accounts should have SQL Server logins in the DR farm, although they are not assigned to SQL Server fixed server or fixed database roles
  • Members of the Farm Administrators SharePoint group should also have SQL Server logins and should be members of the same roles as the Central Administration application pool account

Steps to recover content database(s) from the log ship destination and restore the DB DR server in Auckland

Assuming that you are now operating the Auckland servers and have a copy of your database restored to the SQL server in the DR farm, you can now attach the content database to the standby web application server (DR-SPWFE). You can do this via the Central Administration Interface or the STSADM command line. Personally I prefer the STSADM command line.

Delete the existing database (Which is pre-configured without any site collections)

STSADM -o deletecontentdb -url <http:// WebSiteName:port> -databasename <ContentDatabaseName> -databaseserver <OldPrincipalServer>

Add the recovered database from the Wellington server.

STSADM -o addcontentdb -url <http:// BackupServerName:port> -databasename  <ContentDatabaseName> -databaseserver <NewPrincipalServer>

Alternatively you can follow the steps via Central Administration

  1. Go to SharePoint 3.0 Central Administration
  2. On the Central Administration site, click Application Management
  3. On the Application Management page, in the SharePoint Web Application Management section, click Content databases
  4. On the Manage Content Databases page, click the name of the content database that has failed over
  5. On the Manage Content Database Settings page, in the Remove Content Database section, select the Remove content database check box, and then click OK
  6. On the Manage Content Databases page, click Add a content database
  7. Enter the information for the database you just removed, but replace the information in Database Server with the name of the new principal server

Repeat Steps 4-7 for each database that has failed over

The below diagram outlines the reverse scenario.

DRSolutionReverse

Things that will take time are typically the time for the databases to be restored fully. The larger your content database sizes the longer it will take. Ideally you would have applied quota templates to your site collections or have setup multiple content databases so that you don't end up with a DB larger than 50GB in size. I'd like to hear from anyone who had actually followed through such a proposed plan in a simulated setup to determine the time it would take to typically get back online. The best that I could do was 24 minutes from time of DR to recovery and back online for a very similar setup with 3 content databases of about 20 GB in size.

Happy DR 'ing :-)

More Resources

Posted by Chandima Kulathilake | 0 Comments | Bookmark with:        
Tags: Administration, Deployment, Planning, SharePoint 2007