Administration Guide

High Availability

The computer systems that host data services contain many distinct components, and each component has a "mean time before failure" (MTBF) associated with it. The MTBF is the average time that a component will remain usable. The MTBF for a quality hard drive is in the order of one million hours (approximately 114 years). While this seems like a long time, one out of 200 disks is likely to fail within a 6-month period.

Although there are a number of methods to increase availability for a data service, the most common is an HA cluster. A cluster, when used for high availability, consists of two or more machines, a set of private network interfaces, one or more public network interfaces, and some shared disks. This special configuration allows a data service to be moved from one machine to another. By moving the data service to another machine in the cluster, it should be able to continue providing access to its data. Moving a data service from one machine to another is called a failover, as illustrated in Figure 118.

Figure 118. Failover

The private network interfaces are used to send heartbeat messages, as well as control messages, among the machines in the cluster. The public network interfaces are used to communicate directly with clients of the HA cluster. The disks in an HA cluster are connected to two or more machines in the cluster, so that if one machine fails, another machine has access to them.

A data service running on an HA cluster has one or more logical public network interfaces and a set of disks associated with it. The clients of an HA data service connect via TCP/IP to the logical network interfaces of the data service only. If a failover occurs, the data service, along with its logical network interfaces and set of disks, are moved to another machine.

One of the benefits of an HA cluster is that a data service can recover without the aid of support staff, and it can do so at any time. Another benefit is redundancy. All of the parts in the cluster should be redundant, including the machines themselves. The cluster should be able to survive any single point of failure.

Even though highly available data services can be very different in nature, they have some common requirements. Clients of a highly available data service expect the network address and host name of the data service to remain the same, and expect to be able to make requests in the same way, regardless of which machine the data service is on.

Consider a Web browser that is accessing a highly available Web server. The request is issued with a URL (Uniform Resource Locator), which contains both a host name, and the path to a file on the Web server. The browser expects both the host name and the path to remain the same after a failover of the Web server. If the browser is downloading a file from the Web server, and the server is failed over, the browser will need to reissue the request.

Availability of a data service is measured by the amount of time the data service is available to its users. The most common unit of measurement for availability is the percentage of "up time"; this is often referred to as the number of "nines":

   99.99% => service is down for (at most) 52.6 minutes / yr
   99.999% => service is down for (at most) 5.26 minutes / yr
   99.9999% => service is down for (at most) 31.5 seconds / yr

When designing and testing an HA cluster:

Ensure that the administrator of the cluster is familiar with the system and what should happen when a failover occurs.
Ensure that each part of the cluster is truly redundant and can be replaced quickly if it fails.
Force a test system to fail in a controlled environment, and make sure that it fails over correctly each time.
Keep track of the reasons for each failover. Although this should not happen often, it is important to address any issues that make the cluster unstable. For example, if one piece of the cluster caused a failover five times in one month, find out why and fix it.
Ensure that the support staff for the cluster is notified when a failover occurs.
Do not overload the cluster. Ensure that the remaining systems can still handle the workload at an acceptable level after a failover.
Check failure-prone components (such as disks) often, so that they can be replaced before problems occur.

Fault Tolerance and Continuous Availability

Another way to increase the availability of a data service is fault tolerance. A fault tolerant machine has all of its redundancy built in, and should be able to withstand a single failure of any part, including CPU and memory. Fault tolerant machines are most often used in niche markets, and are usually expensive to implement. An HA cluster with machines in different geographical locations has the added advantage of being able to recover from a disaster affecting only a subset of those locations.

Continuous availability is a step above high availability. It shelters its clients from both planned and unplanned down time. With a continuous availability configuration, the client is completely unaffected if one of the machines hosting the data service fails or is brought down for maintenance. Continuous availability configurations are complex and more expensive to implement.

An HA cluster is the most common solution to increase availability because it is scalable, easy to use, and relatively inexpensive to implement.

[ Top of Page | Previous Page | Next Page ]