Administering High-Availability (HA) systems

A high-availability (HA) system is made up of two or more machines (the primary and one or more backup machines) that are configured identically and designated as a cluster. Each machine is considered a node in the cluster. The primary and backup nodes share a cluster name and IP address. External processes use this name and IP address to access a service on the cluster, which runs on either the primary or one of the backup nodes. All nodes have access to a shared Redundant Arrary of Independent Disks (RAID) storage system. For windows systems only, the shared RAID storage system is used only by the active node.

The HA configuration provides shutdown and automatic restart of unresponsive (failed) software programs, and migration to the cluster backup node when failures on the active node are detected. The cluster backup node assumes the cluster name and IP address and automatically takes over system processing until such time as the failure is corrected on the primary node and a failback is initiated (that is, manually return processing to the original system).

This section provides the following information about how to manage a high-availability system that includes IBM WebSphere InterChange Server:

Supported HA environments

"Maintaining a Windows HA system"

Supported HA environments

The high-availability (HA) option is available on the following operating systems:

http://www.ibm.com/software/integration/supportpace/
category.html#cat2

In addition, both the System Installation Guide for UNIX and for Windows provide basic instructions on how to configure the hardware and software for use in an HA environment.

Maintaining a Windows HA system

Once an HA system is set up according to the configuration instructions provided in the System Installation Guide for Windows, it should need minimal maintenance or reconfiguration. This section summarizes some of the tasks for maintaining an HA system that is set up on a Windows operating system to use the Microsoft Cluster Server (MSCS) software. The following topics are covered:

Checking cluster status

The MSCS Cluster Administrator is the primary administration tool that you use to administer and check the status of the cluster. Each resource, such as IBM WebSphere InterChange Server, is listed along with its state (online or offline, failed, or online or offline pending), owner (node 1 or node 2), and the type of resource (a description such as IBM WebSphere InterChange Server, disk resource, or connector). From this window, you can see the status of the individual services and which of the cluster nodes is active.

Other Windows administrative tools provide information about the status of the cluster. In particular, check the MSCS online help and documentation for details about using the following tools to monitor the cluster:

Windows Event Viewer View and manage System, Security, and Application event logs
Windows Services option in the Control panel Verify that the Cluster Service is running

Detecting a failover

Several types of icons appear in MSCS Administrator beside the various listings of groups, resources, and other elements. The most important icon to recognize is the node-down icon, which indicates that a failure has occurred on a cluster node and that its groups and resources have been transferred to the surviving cluster node. A node-down icon displays in the MSCS Administrator as a cluster node icon with a red x through it.

The node-down icon does not necessarily mean that you have lost functionality in any of your groups or resources. In normal operation, the group fails over to the backup cluster node.

Moving groups to perform maintenance

When you stop the Cluster Service on a node, you prevent clients from accessing cluster resources through that node and all groups move to the other node (if the failover policies allow it). This can be useful when you need to take the primary node offline to perform maintenance or upgrade its software. The following steps describe how to perform a move to the backup node for maintaining or upgrading the primary node:

  1. Stop the Cluster Service on the backup node by clicking its node icon, clicking the File menu, and selecting Stop Cluster Service.
  2. Upgrade the backup node if a software upgrade is the maintenance being performed. Be sure to place executables and libraries on the node disk and data files on the shared RAID.
  3. Restart the Cluster Service on the backup node by clicking File > Start Cluster Service.
  4. On the primary node, right-click each group and select Move Group.

    This allows you to take a server offline without losing availability of your resources.

  5. Stop the Cluster Service on the primary node by clicking its node icon, then clicking File > Stop Cluster Service.
  6. Upgrade the primary node in the same manner as you upgraded the backup node.
  7. Restart the Cluster Service on the primary node by clicking File > Start Cluster Service.
  8. On the backup node, right-click each group and select Move Group to move the groups back to the primary node.

Changing the status of a resource

Use Cluster Administrator to manually bring individual resources online or take them offline. Change the status by selecting the resource, and from the File menu, selecting either Bring Online, Take Offline, or Initiate Failure. You should only off or online WebSphere MQ, connectors, and IBM WebSphere InterChange Server.

Copyright IBM Corp. 1997, 2004