Active/passive cluster failover configurations

Generally, in an active/passive cluster failover configuration, one or more passive or standby nodes are available to take over for failed nodes. Only the primary node is used for processing. When a node fails, the standby node takes over the resources and the identity of the failed node. The services provided by the failed node are started on the standby node. After the “take over”, clients are able to access the services unaware that the services are being provided by a different node.

The following figure illustrates an active/passive database failover configuration. Both the active/passive nodes share the same disk subsystem although only the primary database server has access to the disk subsystem. The path from the standby node to the shared disk subsystem is not activated.

During normal operations, the application connects to the database server with a hostname of dbprod that gets resolved to an IP address of 192.168.10.1.

failover_configuration

Active/passive database failover process

During a node failure, the following typically occurs.

On the original primary node:

  1. If the primary node is still up, the services on the primary node are brought down.
  2. All resources (specifically the disk subsystem) from the primary node are released.
  3. The service IP address (192.168.10.1) is released.

On the standby node:

  1. The disk subsystem is brought online.
  2. File systems are checked and repairs are made if needed.
  3. The service IP address (192.168.10.1) is configured.
  4. The services are started – database rollforward recovery is initiated as necessary.
  5. The database services are opened.

These failover or takeover steps can be automated. Some of the software that can be used include:

Fully automated, the failover could take 5 to 10 minutes.

In subsequent sections, we present the use of active/passive failover configurations to protect many of the Sterling Selling and Fulfillment Foundation components in more detail.