Failover time is measured from when data is first unavailable to when it is available again. A number of events that occur during a failover can contribute significantly to the failover time:
Deporting and importing disks usually does not take a very long time compared to other events, although it does contribute to the overall down time. The more disks that need to be moved from one machine to another during a failover, the longer the process takes. If there are defective disks, the process can take even longer.
Before the file systems of the logical host can be mounted, they must pass an fsck to ensure the health of the file system. The larger the file system, the longer this process takes. By using a journalled file system, this time can be drastically reduced. Since journalled file systems are normally used in an HA environment, the fsck time is usually not an issue.
The HA agent will call user scripts if they exist and are executable. Some of these scripts are run synchronously, and can add to the time it takes to bring up the HA instances. Ensure that they run as quickly as possible; consider running any external programs called by these scripts in the background.
For a single EEE instance in a mutual takeover configuration, HA-NFS must be used for the home directory of the instance owner. HA-NFS adds to failover time because of the grace period for lockd (defined in the HA agent for HA-NFS), which is 90 seconds when running HA-NFS. This affects failover times, because any process that locks a file on the HA-NFS file system after a failover must wait until the grace period is over. The HA agent for DB2 is the first process to lock a file under the instance owner's home directory after a failover, and it records the time it takes to obtain the first lock. This time is displayed in the status report after a failover.
Starting DB2 contributes only a small amount to the failover time. For an EE instance, it contributes about 5-15 seconds on average. For an EEE instance, it contributes about 10 seconds, plus about 5 seconds per database partition that is being failed over. If three database partitions are being failed over, for example, the failover time contributed by starting these three database partitions will be approximately 25 seconds. This does not include crash recovery for the databases of the instance.
Crash recovery often contributes to the majority of down time associated with a failover. How long it takes to recover a database depends on a number of factors, including: