High availability (HA), answers the
question 'what do I do in case a single machine fails?' - means a machine that can immediately take
over in case of a problem with the main machine with little down time, and no
loss of data.
HA is the measurement of a system’s ability to remain accessible
in the event of a system component failure. Generally, HA is implemented by
building in multiple levels of fault tolerance and/or load balancing
capabilities into a system
Disaster Recovery (DR), on the
other hand, answers 'what do I do in case a disaster happens (fire, floods,
war, ISP goes bankrupt, whatever) to the whole data center?'. it is something
intended to take over in the event of a
disaster at the main site.
DR is the
process by which a system is restored to a previous acceptable state, after a
natural or man-made disaster.
While both
increase overall availability, a notable difference is that with HA there is,
generally, no loss of service. HA refers to the retaining of the service and DR
to the retaining of the data. Whereas, with DR there is usually a slight loss
of service while the DR plan is executed and the system is restored. HA and DR
strategies should strive to address any non-functional requirements, such as
performance, system availability, fault tolerance, data retention, business
continuity, and user experience. It is imperative that selection of the
appropriate HA and DR strategy be driven by business requirements. For HA,
determine any service level agreements expected of your system. For DR, use
measurable characteristics, such as Recovery Time Objective (RTO) and Recovery Point Objective (RPO), to drive your DR
plan.
The following requirements are the most common IT
considerations for establishing an HADR solution:
Recovery time objective (RTO)
The time as measured from the time of application
unavailability to the time of recovery (resuming business operations).
Recovery point objective (RPO)
The last data point to which production is recovered upon a
failure. Ideally, customers want the RPO to be zero lost data. Practically
speaking, we tend to accept a recovery point associated with a particular
application state.
A comprehensive end to end HADR solution has the following
basic components:
- Application data resiliency
Data resiliency is the base or foundational element for a
high availability and disaster recovery solution deployment. Methods and characteristics :
Storage-based resiliency: Storage replication is the most
commonly used technique for deploying cluster-wide data resiliency. There are
two general categories for storage-based resiliency: shared-disk topology and
shared-everything topology.
Log-based replication: Log-based replication is a form of
resiliency primarily associated with databases. Typically, database logs are
used to monitor changes that are then replicated to a second system where those
changes are applied.
- Application infrastructure resiliency
Infrastructure resiliency provides the overall environment
that is required to resume full production at a standby node. This environment
includes the entire list of resources that the application requires upon
failover for the operations to resume automatically. Methods and characteristics:
Application infrastructure resiliency has two aspects. First,
it provides the application with all the resources that it requires to resume
operations at an alternate node in the cluster. Second, it provides for cluster
integrity by using monitoring and verification. These resources include items
such as dependent hardware, middleware, IP connectivity, configuration files,
attached devices (printers), security profiles, application specific custom
resources (crypto card) and the application data itself.
- Application state resiliency
Application state resiliency is characterized by the
application recovery point as described when the production environment resumes
on a secondary node in the cluster. Characteristic of the application to resume
varies by application design and customer requirements.
Where will the application recovery point be with respect to
the last application transaction? If your application is designed with commit boundaries
and the outage is an unplanned failover, then the recovery point in the
application will be to that last commit boundary. If you are conducting a planned
outage role swap, then the application is quiesced so that memory can be
flushed to the shared-disk resource and the data and application are
subsequently varied on to the secondary node
A complete end-to-end solution incorporates all three
elements into one integrated environment that addresses one or all of the
outage.
A solution to a customer depends upon the inclusion and
incorporation of these basic elements into the clustering configuration. For
example, you can have a solution based purely upon data resiliency and leave
the application resiliency aspects of the final recovery process to IT
operational procedures. Alternatively you can incorporate the data resiliency
into the overall clustering topology enabling automated recovery processing.
References:
1 comment:
Thanks for sharing very nice article on RPO disaster recovery and backups. It help me to understand more about RPO and disaster recovery and backups.
Post a Comment