Administration Guide

Sun Cluster 2.2

Sun Cluster 2.2 (SC2.2) is Sun Microsystems' clustering and high availability product. SC2.2 supports up to four machines in a single cluster. Using four Ultra Enterprise 10000s, a cluster can have up to 256 CPUs and 256 GB of RAM.

Supported Systems

System	UltraSPARC	Memory Capacity	I/O
Ultra Enterprise 1	1	64MB-1GB	3 SBus
Ultra Enterprise 2	1-2	64MB-2GB	4 SBus
Ultra Enterprise 450	1-4	32MB-4GB	10 PCI
Ultra Enterprise 3000	1-6	64MB-6GB	9 SBus
Ultra Enterprise 4000	1-14	64MB-14GB	21 SBus
Ultra Enterprise 5000	1-14	64MB-14GB	21 SBus
Ultra Enterprise 6000	1-30	64MB-30GB	45 SBus
Ultra Enterprise 10000	1-64	512MB-64GB	64 SBus

Agents

The Sun Cluster software includes a number of high availability agents that are supported and shipped with the SC2.2 product. Other HA agents, such as the one for DB2, are developed outside of Sun, and are not shipped with the Sun Cluster software. The HA agent for DB2 is shipped with DB2, and supported by IBM.

The Sun Cluster software works with highly available data services by providing an opportunity to register methods (scripts or programs) that correspond to various components of the Sun Cluster software. Utilizing these methods, the SC2.2 software can control a data service without having intimate knowledge of it. These methods include:

START: Used to start portions of the data service before the logical network interfaces are online.
START_NET: Used to start portions of the data service after the logical network interfaces are online.
STOP: Used to stop portions of the data service after the logical network interfaces are offline.
STOP_NET: Used to stop portions of the data service before the logical network interfaces are offline.
ABORT: Like the STOP method, except it is run just before a machine is brought down by the cluster software. In this case, the machine's "health" is in question, and a data service may want to execute "last wish" requests before the machine is brought down. Run after the logical network interfaces are offline.
ABORT_NET: Like the ABORT method, except it is run before the logical network interfaces are offline.
FM_INIT: Used to initialize fault monitors.
FM_START: Used to start the fault monitors.
FM_STOP: Used to stop the fault monitors.
FM_CHECK: Called by the hactl command. Returns the current status of the corresponding data service.

The DB2 agent consists of the following scripts: START_NET, STOP_NET, FM_START, and FM_STOP. The following scripts are not run during cluster reconfiguration: ABORT, ABORT_NET, and FM_CHECK.

A high availability agent consists of one or more of these methods. The methods are registered with SC2.2 through the hareg command. Once registered, the Sun Cluster software will call the corresponding method to control the data service.

It is important to remember that the ABORT and STOP methods of a service may not be called. These methods are intended for the controlled shutdown of a data service, and the data service must be able to recover if a machine fails without calling them.

For more information, refer to the Sun Cluster documentation.

Logical Hosts

The SC2.2 software uses the concept of a logical host. A logical host consists of a set of disks and one or more logical public network interfaces. A highly available data service is associated with a logical host, and requires the disks that are in the disk groups of the logical host. Logical hosts can be hosted by different machines in the cluster, and "borrow" the CPUs and memory of the machine on which they are running.

Logical Network Interfaces

As with other UNIX based operating systems, Solaris has the ability to have extra IP addresses, in addition to the primary one for a network interface. The extra IP addresses reside on a logical interface in the same way that the primary IP address resides on the physical network interface. Following is an example of the logical interfaces on two machines in a cluster. There are two logical hosts, and both are currently on the machine "thrash".

   scadmin@crackle(202)# netstat -in
   Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
   lo0 8232 127.0.0.0 127.0.0.1 289966 0 289966 0 0 0
   hme0 1500 9.21.55.0 9.21.55.98 121657 6098 764122 0 0 0
   scid0 16321 204.152.65.0 204.152.65.1 489307 0 476479 0 0 0
   scid0:1 16321 204.152.65.32 204.152.65.33 0 0 0 0 0 0
   scid1 16321 204.152.65.16 204.152.65.17 347317 0 348073 0 0 0
 
     1. lo0 is the loopback interface
     2. hme0 is the public network interface (ethernet)
     3. scid0 is the first private network interface (SCI or Scalable
        Coherent Interface)
     4. scid0:1 is a logical network interface that the Sun Cluster software
        uses internally
     5. scid1 is the second private network interface
 
   scadmin@thrash(203)# netstat -in
   Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
   lo0 8232 127.0.0.0 127.0.0.1 1128780 0 118780 0 0 0
   hme0 1500 9.21.55.0 9.21.55.92 1741422 5692 757127 0 0 0
   hme0:1 1500 9.21.55.0 9.21.55.109 0 0 0 0 0 0
   hme0:2 1500 9.21.55.0 9.21.55.110 0 0 0 0 0 0
   scid0 16321 204.152.65.0 204.152.65.2 476641 0 489476 0 0 0
   scid0:1 16321 204.152.65.32 204.152.65.34 0 0 0 0 0 0
   scid1 16321 204.152.65.16 204.152.65.18 348199 0 347444 0 0 0
 
     1. hme0:1 is a logical network interface for a logical host
     2. hme0:2 is a logical network interface for another logical host

A logical host can have one or more logical interfaces associated with it. These logical interfaces move with the logical host from machine to machine, and are used to access the data service that is associated with the logical host. Because these logical interfaces move with the logical hosts, clients can access the data service independently of the machine on which it resides.

A highly available data service should bind to the TCP/IP address INADDR_ANY. This ensures that each IP address on the system can accept connections for the data service. If a data service binds to a specific IP address instead, it must bind the logical interface associated with the logical host that is hosting the data service. Binding to INADDR_ANY also removes the need to rebind to a new IP address if one arrives on the system that is needed by the data service.
Note: Clients of an HA instance should catalog the database using the host name for the logical IP address of a logical host. They should never use the primary host name for a machine, because there is no guarantee that DB2 will be running on that machine.

Disk Groups and File Systems

Disks for a data service are associated with a logical host in groups (or sets). If the cluster is running Sun StorEdge Volume Manager (Veritas), the Sun Cluster software uses the Veritas "vxdg" utility to import and deport the disk groups for each logical host. Following is an example of the disk groups for two logical hosts, "log0" and "log1", which are being hosted by a machine called "thrash". The machine called "crackle" is not currently hosting any logical hosts.

   scadmin@crackle(206)# vxdg list
   NAME STATE ID
   rootdg enabled 899825206.1025.crackle
 
   scadmin@thrash(205)# vxdg list
   NAME STATE ID
   rootdg enabled 924176206.1025.thrash
   data0 enabled 925142028.1157.crackle=
   data1 enabled 899826248.1108.crackle

The disk groups "data0" and "data1" correspond to the logical hosts "log0" and "log1", respectively. The disk group "data0" can be deported from "thrash" by running

   vxdg deport data0

and imported to "crackle" by running

   vxdg import data1

This is done automatically by the Sun Cluster software, and should not be done manually on a live cluster.

Each disk group contains a number of disks that can be shared between two or more machines in the cluster. A logical host can only be moved to another machine that has physical access to the disks in the disk groups that belong to it.

There are two files that control the file systems for each logical host:

   /etc/opt/SUNWcluster/conf/hanfs/vfstab.<logical_host>
   /etc/opt/SUNWcluster/conf/hanfs/dfstab.<logical_host>

where logical_host is the name of the associated logical host name.

The vfstab file is similar to the /etc/vfstab file, except that it contains entries for the file systems to be mounted after the disk groups have been imported for a logical host. The dfstab file is similar to the /etc/dfs/dfstab file, except that is contains entries for file systems to export through HA-NFS for a logical host. Each machine has its own copy of these files, and care should be taken to ensure that they have the same content on each machine in the cluster.
Note: The paths for the vfstab and dfstab files of a logical host are misleading, because they contain the directory hanfs. Only the dfstab file for a logical host is used for HA-NFS. The vfstab file is used, even if HA-NFS is not configured.

Following are examples from a cluster running DB2 Universal Database Enterprise - Extended Edition (EEE) in a mutual takeover configuration:

   scadmin@thrash(217)# ls -l /etc/opt/SUNWcluster/conf/hanfs
   total 8
   -rw-r--r-- 1 root build 173 Apr 14 15:01 dfstab.log0
   -rw-r--r-- 1 root build 316 Apr 26 12:07 vfstab.log0
   -rw-r--r-- 1 root build 389 Apr 13 21:04 vfstab.log1
 
   scadmin@thrash(218)# cat dfstab.log0
   share -F nfs -o root=crackle:thrash:\
   jolt:bump:crackle.torolab.ibm.com:thrash.torolab.ibm.com:\
   jolt.torolab.ibm.com:bump.torolab.ibm.com /log0/home

The hosts, which are given permission to mount the file system, /log0/home, are from all of the network interfaces (logical and physical) on each machine in the cluster. The file systems are exported with root permissions.

scadmin@thrash(220)# cat vfstab.log0
#device to mount             device  to fsck               mount       FS   fsck mount   options
#                                                          point       type pass at boot 
 
/dev/vx/dsk/data0/data1-stat /dev/vx/rdsk/data0/data1-stat /log0       ufs  2    no      -
/dev/vx/dsk/data0/vol01      /dev/vx/rdsk/data0/vol01      /log0/home  ufs  2    no      -
/dev/vx/dsk/data0/vol02      /dev/vx/rdsk/data0/vol02      /log0/data  ufs  2    no      -
 
scadmin@thrash(221)# cat vfstab.log1
#device to mount             device  to fsck               mount       FS   fsck mount   options
#                                                          point       type pass at boot
 
/dev/vx/dsk/data1/data1-stat /dev/vx/rdsk/data1/data1-stat /log1       ufs  2    no      -
/dev/vx/dsk/data1/vol01      /dev/vx/rdsk/data1/vol01      /log1/home  ufs  2    no      -
/dev/vx/dsk/data1/vol02      /dev/vx/rdsk/data1/vol02      /log1/data  ufs  2    no      -
/dev/vx/dsk/data1/vol03      /dev/vx/rdsk/data1/vol03      /log1/data1 ufs  2    no      -

The vfstab.log0 file contains three valid entries for file systems under the /log0 directory. Notice that the file systems for the logical host log0 use logical volume devices, which are part of the disk group data0 that is associated with the logical host.

The file systems in the vfstab files are mounted in order from top to bottom, so it is important to ensure that the file systems are listed in the correct order. File systems that are mounted underneath a particular file system should be listed below it. The actual file systems that are needed for a logical host depend on the needs of the data service, and will vary considerably from these examples.

During a failover, the SC2.2 software is responsible for ensuring that the disk groups and logical interfaces associated with a logical host follow it around the cluster from machine to machine. The highly available data service expects to have at least these resources available on a new system after a failover. In fact, many data services are not even aware that they are highly available, and must have these resources "appear" to be exactly the same after a failover.

Control Methods

The control methods are registered using

   hareg(1m)

Once an HA service is registered, SC2.2 is responsible for calling the methods that were registered for the HA service at appropriate times during a cluster reconfiguration or failover.

The following actions take place (in the given order) during a cluster reconfiguration (controlled failover). Actions preceding step 5c will not be taken if a machine crashes. (For more information about cluster reconfiguration, refer to the SC2.2 documentation.)

  1. FM_STOP method is run.
  2. STOP_NET method is run.
  3. Logical interfaces for the logical host are brought offline.
       - ifconfig hme0:1 0.0.0.0 down
  4. STOP method is run.
  5. Disk groups and file systems are moved.
       a. Unmount logical host file systems.
       b. vxdg deport disk groups on one machine.
 
     - - Only the steps below are run if a machine crashes - -
 
       c. vxdg import disk groups on the other machine.
       d. fsck logical host file systems.
       e. Mount logical host file systems.
  6. START method is run.
  7. Logical interfaces for the logical host are brought online.
       - ifconfig hme0:1 <ip address> up
  8. START_NET method is run.
  9. FM_INIT method is run.
 10. FM_START method is run.

The control methods are run with the following command line arguments:

   METHOD <logical hosts being hosted> <logical hosts not being hosted> <time-out>

The first argument is a comma delimited list of logical hosts that are currently being hosted, and the second is a comma delimited list of logical hosts that are not being hosted. The last argument is the time-out for the method, the amount of time that the method is allowed to run before the SC2.2 software aborts it.

Disk and File System Configuration

SC2.2 supports two volume managers: Sun StorEdge Volume Manager (Veritas) and Solstice Disk Suite. Although both work well, the StorEdge Volume Manager has some advantages in a clustered environment. In some cluster configurations, the controller number for a disk enclosure can be different for each machine in the cluster. If the controller number is different, the paths for the disk devices for the controller will also be different. Because Disk Suite works directly with the disk device paths, it will not work well in this situation. The StorEdge Volume Manager works with the disks themselves, regardless of the controller number, and is not affected if the controller numbers are different.

Since the goal of HA is to increase availability for a data service, it is important to ensure that all file systems and disk devices are mirrored, or in a RAID configuration. This will prevent failovers due to a failed disk, and increase the stability of the cluster.

HA-NFS

DB2 UDB EEE requires a shared file system when an instance is configured across multiple machines. A typical DB2 UDB EEE configuration has the home directory exported from one machine through NFS, and mounted on all of the machines participating in the EEE instance. For a mutual takeover configuration, DB2 UDB EEE depends on HA-NFS to provide a shared, highly available file system. One of the logical hosts exports a file system through HA-NFS, and each machine in the cluster then mounts the file system as the home directory of the EEE instance. For more information about HA-NFS, refer to the Sun Cluster documentation.

The cconsole and ctelnet Utilities

Two useful utilities that come with SC2.2 are cconsole and ctelnet. These utilities can be used to issue a single command to several machines in a cluster simultaneously. Editing a configuration file with these utilities ensures that it will remain identical on all of the machines in the cluster. These utilities can also be used to install software in exactly the same way on each machine. For more information about these utilities, refer to the Sun Cluster documentation.

Campus Clustering and Continental Clustering

A cluster is called a campus cluster when its machines are not in the same building. A campus cluster is useful for removing the building itself as the single point of failure. For example, if the machines in the cluster are all in the same building, and it burns down, the entire cluster is affected. However, if the machines are in different buildings, and one of the buildings burns down, the cluster survives.

A continental cluster is a cluster whose machines are distributed among different cities. In this case, the goal is to remove the geographic region as the single point of failure. This type of cluster provides protection against catastrophic events, such as earthquakes and tidal waves.

Currently, a Sun Cluster can support machines as far apart as 10 km, or about 6 miles. This makes campus clustering a viable option for those who need high speed connections between two different sites. A cluster requires two private interconnects, and a number of fiber optic cables for the shared disks. The cost of high speed connections between two sites may offset the benefits.

Common Problems

The SC2.2 software uses the Cluster Configuration Database, or CCD(4), to provide a single cluster-wide repository for the cluster configuration. The CCD has a private API and is stored under the /etc/opt/SUNWcluster/conf directory. In rare cases, the CCD can go out of synchrony, and may need to be repaired. The best way to repair the CCD in this situation is to restore it from a backup copy.

To back up the CCD, shut down the cluster software on all machines in the cluster, "tar" up the /etc/opt/SUNWcluster/conf directory, and store the tar file in a safe place. If the cluster software is not shut down when the backup is made, you may have trouble restoring the CCD. Ensure that the backup copy is kept up-to-date by taking a fresh backup any time that the cluster configuration is changed. To restore the CCD, shut down the cluster software on all machines in the cluster, move the conf directory to conf.old, and "untar" the backup copy. The cluster can then be started with the new CCD.

[ Top of Page | Previous Page | Next Page ]