Administration Guide

Failover Time

Failover time is measured from when data is first unavailable to when it is available again. A number of events that occur during a failover can contribute significantly to the failover time:

Disk deporting and importing.
Deporting and importing disks usually does not take a very long time compared to other events, although it does contribute to the overall down time. The more disks that need to be moved from one machine to another during a failover, the longer the process takes. If there are defective disks, the process can take even longer.
Fsck of the file systems that are mounted for a logical host.
Before the file systems of the logical host can be mounted, they must pass an fsck to ensure the health of the file system. The larger the file system, the longer this process takes. By using a journalled file system, this time can be drastically reduced. Since journalled file systems are normally used in an HA environment, the fsck time is usually not an issue.
User scripts called from the HA agent.
The HA agent will call user scripts if they exist and are executable. Some of these scripts are run synchronously, and can add to the time it takes to bring up the HA instances. Ensure that they run as quickly as possible; consider running any external programs called by these scripts in the background.
HA-NFS.
For a single EEE instance in a mutual takeover configuration, HA-NFS must be used for the home directory of the instance owner. HA-NFS adds to failover time because of the grace period for lockd (defined in the HA agent for HA-NFS), which is 90 seconds when running HA-NFS. This affects failover times, because any process that locks a file on the HA-NFS file system after a failover must wait until the grace period is over. The HA agent for DB2 is the first process to lock a file under the instance owner's home directory after a failover, and it records the time it takes to obtain the first lock. This time is displayed in the status report after a failover.
Starting DB2.
Starting DB2 contributes only a small amount to the failover time. For an EE instance, it contributes about 5-15 seconds on average. For an EEE instance, it contributes about 10 seconds, plus about 5 seconds per database partition that is being failed over. If three database partitions are being failed over, for example, the failover time contributed by starting these three database partitions will be approximately 25 seconds. This does not include crash recovery for the databases of the instance.
Database crash recovery.
Crash recovery often contributes to the majority of down time associated with a failover. How long it takes to recover a database depends on a number of factors, including:
- Client workload. Only changes to the database are logged in the transaction logs. If the client workload is mostly read-only operations, relatively few transactions must be applied to the database during crash recovery.
- Disk and machine speed. The speed of the disks and the machine that is hosting the HA instance also contributes to the time it takes to recover the database. The faster the system, the shorter the crash recovery time.
- Value of the softmax database configuration parameter. The value of softmax is the percentage of the log file size at which a soft checkpoint is to be taken, and a log control file is to be written. The log control file is used during crash recovery to determine which log records are truly necessary to restore the database to a consistent state. Reducing this value will cause the database manager to trigger the page cleaners more often, and take more frequent soft checkpoints; although performance is reduced, database recovery is faster.
- Whether the instance is EE or EEE. If the instance is an EEE instance, the database restart operations will be done in parallel. Each database partition is responsible for restarting its own portion of the databases. If there are 50 GB of data for a database, an instance with four database partitions will be able to recover the database roughly four times faster than an EE instance can.

[ Top of Page | Previous Page | Next Page ]