Hai Tao's Blog: High Availability (HA), Disaster Recovery (DR) & HADR Solution

High availability (HA), answers the question 'what do I do in case a single machine fails?' - means a machine that can immediately take over in case of a problem with the main machine with little down time, and no loss of data.

HA is the measurement of a system’s ability to remain accessible in the event of a system component failure. Generally, HA is implemented by building in multiple levels of fault tolerance and/or load balancing capabilities into a system

Disaster Recovery (DR), on the other hand, answers 'what do I do in case a disaster happens (fire, floods, war, ISP goes bankrupt, whatever) to the whole data center?'. it is something intended to take over in the event of a disaster at the main site.

DR is the process by which a system is restored to a previous acceptable state, after a natural or man-made disaster.

While both increase overall availability, a notable difference is that with HA there is, generally, no loss of service. HA refers to the retaining of the service and DR to the retaining of the data. Whereas, with DR there is usually a slight loss of service while the DR plan is executed and the system is restored. HA and DR strategies should strive to address any non-functional requirements, such as performance, system availability, fault tolerance, data retention, business continuity, and user experience. It is imperative that selection of the appropriate HA and DR strategy be driven by business requirements. For HA, determine any service level agreements expected of your system. For DR, use measurable characteristics, such as Recovery Time Objective (RTO) and Recovery Point Objective (RPO), to drive your DR plan.

The following requirements are the most common IT considerations for establishing an HADR solution:

Recovery time objective (RTO)

The time as measured from the time of application unavailability to the time of recovery (resuming business operations).

Recovery point objective (RPO)

The last data point to which production is recovered upon a failure. Ideally, customers want the RPO to be zero lost data. Practically speaking, we tend to accept a recovery point associated with a particular application state.

A comprehensive end to end HADR solution has the following basic components:

- Application data resiliency

Data resiliency is the base or foundational element for a high availability and disaster recovery solution deployment. Methods and characteristics :

Storage-based resiliency: Storage replication is the most commonly used technique for deploying cluster-wide data resiliency. There are two general categories for storage-based resiliency: shared-disk topology and shared-everything topology.

Log-based replication: Log-based replication is a form of resiliency primarily associated with databases. Typically, database logs are used to monitor changes that are then replicated to a second system where those changes are applied.

- Application infrastructure resiliency

Infrastructure resiliency provides the overall environment that is required to resume full production at a standby node. This environment includes the entire list of resources that the application requires upon failover for the operations to resume automatically. Methods and characteristics:

Application infrastructure resiliency has two aspects. First, it provides the application with all the resources that it requires to resume operations at an alternate node in the cluster. Second, it provides for cluster integrity by using monitoring and verification. These resources include items such as dependent hardware, middleware, IP connectivity, configuration files, attached devices (printers), security profiles, application specific custom resources (crypto card) and the application data itself.

- Application state resiliency

Application state resiliency is characterized by the application recovery point as described when the production environment resumes on a secondary node in the cluster. Characteristic of the application to resume varies by application design and customer requirements.

Where will the application recovery point be with respect to the last application transaction? If your application is designed with commit boundaries and the outage is an unplanned failover, then the recovery point in the application will be to that last commit boundary. If you are conducting a planned outage role swap, then the application is quiesced so that memory can be flushed to the shared-disk resource and the data and application are subsequently varied on to the secondary node

A complete end-to-end solution incorporates all three elements into one integrated environment that addresses one or all of the outage.

A solution to a customer depends upon the inclusion and incorporation of these basic elements into the clustering configuration. For example, you can have a solution based purely upon data resiliency and leave the application resiliency aspects of the final recovery process to IT operational procedures. Alternatively you can incorporate the data resiliency into the overall clustering topology enabling automated recovery processing.

References:

http://technet.microsoft.com/en-us/library/hh393522.aspx

http://www.redbooks.ibm.com/redpapers/pdfs/redp4669.pdf

http://www.drj.com/articles/online-exclusive/understanding-high-availability-and-disaster-recovery-in-your-overall-recovery-strategy.html

DB HADR google Doc

IBM HADR google Doc