Administration Guide

Troubleshooting

The following table identifies problems that you might encounter, their probable causes, and actions that you can take to solve them.

Table 61. Troubleshooting High Availability on Sun Cluster 2.2

Symptom Possible cause Action
Cannot mount logical host file system The logical host file system is normally mounted and unmounted during the failover of a logical host. During failover, there should be no active processes or open files under the logical host file system. In rare cases, processes that cannot be killed have their current working directory under the logical host file system. To find out if a process is under the mount point, use fuser(1m), or a GNU utility called lsof. Error messages are produced when the logical host file system cannot be mounted.^a Reboot the system, or move the logical host file system to another name and recreate it. Doing this allows the frozen process to stay under the directory (since it can't be killed), and allows the mount to take place.^b
The db2start or db2stop time-out does not work A SIGALRM signal may not break out of a blocking system call. Instead, the system call will restart as if the SA_RESTART flag were set with sigaction(). This causes time-outs for the DB2 HA agents to be ignored, and the agent method will hang instead of recovering from a hung db2start or db2stop command. Apply the required patch, 105210-17 (or later), for Solaris 2.6.
Logging into an instance hangs Although there are numerous reasons why this can happen, the most common reasons include NFS problems and the /usr/sbin/quota program. Check the NFS mounts to ensure that they are healthy, and look for quota processes owned by the instance owner. At the discretion of the system administrator, changing the quota program to a symbolic link to /bin/true may solve the problem. This is not a recommended solution, but it may work.
I just set up an EEE instance, but it does not start The hadb2_setup command does not add ports to the /etc/services file; it is expected that the administrator will add them manually. An error message is returned.^c Ensure that you have appropriate ports named in the /etc/services file.
START_NET method cannot start DB2
Turn off fault monitoring to ensure that the instance does not get failed over. Log in as the instance owner, and try to start DB2 manually.

Ensure that the hadb2tab configuration file has the correct instance type specified. For example, having a db2nodes.cfg file for an EE administrative instance will cause problems, and the HA agent methods will not be able to recover from this.
Ensure that the .rhosts file exists, and has valid entries in it.
Ensure that the HA-NFS file system is shared with root permissions for all machines in the cluster.
Check the kernel parameters, and ensure that they are correct.
Ensure that the /etc/services file contains entries for the instance.

The instance only works on one machine

The numeric uid for the instance may not be the same on each machine in the cluster.
The kernel parameters may not be valid on each machine in the cluster.
The hadb2tab file may not be the same on each machine in the cluster.
Other configuration files, such as the logical host vfstab file, may not be the same on each machine in the cluster.

If none of these causes appears to apply, try logging in as the instance owner, and start DB2 manually. For EE instances, this should work if the logical host that is hosting the instance is being hosted by the current machine. For EEE instances, this should work from any machine in the cluster that can host the database partitions.
su - <instance> -c "db2start" does not work

The .profile for the instance may not be su-"friendly".
There is a known problem with the Bourne shell (/bin/sh), in which the su command works manually, but not through the HA agent.

As root, try running this command manually, and ensure that it works before trying again through the HA agent.
Switch to the Korn shell (/bin/ksh), if necessary.

My EEE instance cannot start, but the home directory is mounted The HA-NFS directory may not have been exported with "root" permissions to the machines in the cluster. Both DB2 and the HA agents require this to run properly. To test this, try to create a file (as root) under the instance owner's home directory.
Trying to access the EEE instance directory returns a "Stale NFS file handle" error There may still be processes under the instance owner's home directory. Unmount the instance owner's home directory, and allow the HA agent to remount it. The HA agent will remount it if the hadb2 service is turned off and on again (see a description of the -s switch on the hadb2_setup command in The hadb2_setup Command).
Control methods do not run successfully through SC2.2 The hadb2 service may not be registered with the Sun Cluster software, or it may not be turned on. If the control methods appear to run normally from the command line, check the SYSLOG files for error messages that may help to explain the problem. Ensure that the hadb2 service is registered with the Sun Cluster software, and that it is turned on.
Running the methods manually is useful for debugging a problem.^d
The methods should be run as root and given the appropriate command line arguments. If the list of logical hosts is nil, the argument should be given as "". The double quotation marks without a blank space separator denotes a blank argument. For example:
hadb2_startnet log0,log1 "" 600

The first argument, log0,log1, tells the hadb2_startnet method that logical hosts log0 and log1 are being hosted by the current machine. The second argument is nil, which tells the hadb2_startnet method that there are no other logical hosts being hosted on other machines in the cluster (all of them are on the current machine). The third argument tells the method that SC2.2 will time out after 600 seconds.
User scripts do not run The user scripts can only be run if they exist in the appropriate directories and are executable. Check file ownership and attributes. If a script still fails to run, contact IBM service. Forward a directory listing of the script that does not run, and SYSLOG output for a failover or a cluster reconfiguration that should have run the script.
Information is not being logged to the file specified in /etc/syslog.conf
Use touch(1) to create the file that is specified in the /etc/syslog.conf file, and then restart the SYSLOG daemon.

^a Error messages that are produced when the logical host file system cannot be mounted may look something like the following:
Aug 17 11:14:01 rash ID[SUNWcluster.loghost.1170]: importing data1 Aug 17 11:14:06 rash ID[SUNWcluster.scnfs.3040]: mount -F ufs -o "" /dev/vx/dsk/data1/data1-stat /log1 failed. Aug 17 11:14:07 rash ID[SUNWcluster.ccd.ccdd.5304]: error freeze cmd = /opt/SUNWcluster/bin/loghost_sync CCDSYNC_POST_ADDU LOGHOST_CM:log1:rash /etc/opt/SUNWcluster/conf/ccd.database 2 "0 1" 1 error code = 1

^b For example:
scadmin@rash(218)# ps -fe | egrep db2 db2ee 1984 1 0 0:01 <defunct> Solution: scadmin@rash(229)# cd / scadmin@rash(230)# mv /log1 /log1.bkp scadmin@rash(231)# mkdir /log1

^c The error message may look something like the following:

SQL6030N START or STOP DATABASE MANAGER failed. Reason code "13".

^d For example, if the hadb2_startnet method cannot find libdb2.so.1, but it runs normally through the Sun Cluster software, no errors will be reported. Running the method manually results in the following:
scadmin@crackle(213)# hadb2_startnet '''log0,log1' 600 ld.so.1: hadb2_startnet: fatal: libdb2.so.1: open failed: No such file or directory Killed

Symptom	Possible cause	Action
Cannot mount logical host file system	The logical host file system is normally mounted and unmounted during the failover of a logical host. During failover, there should be no active processes or open files under the logical host file system. In rare cases, processes that cannot be killed have their current working directory under the logical host file system. To find out if a process is under the mount point, use `fuser(1m)`, or a GNU utility called `lsof`. Error messages are produced when the logical host file system cannot be mounted.^a	Reboot the system, or move the logical host file system to another name and recreate it. Doing this allows the frozen process to stay under the directory (since it can't be killed), and allows the mount to take place.^b
The db2start or db2stop time-out does not work	A SIGALRM signal may not break out of a blocking system call. Instead, the system call will restart as if the SA_RESTART flag were set with `sigaction()`. This causes time-outs for the DB2 HA agents to be ignored, and the agent method will hang instead of recovering from a hung db2start or db2stop command.	Apply the required patch, 105210-17 (or later), for Solaris 2.6.
Logging into an instance hangs	Although there are numerous reasons why this can happen, the most common reasons include NFS problems and the `/usr/sbin/quota` program.	Check the NFS mounts to ensure that they are healthy, and look for quota processes owned by the instance owner. At the discretion of the system administrator, changing the quota program to a symbolic link to `/bin/true` may solve the problem. This is not a recommended solution, but it may work.
I just set up an EEE instance, but it does not start	The hadb2_setup command does not add ports to the `/etc/services` file; it is expected that the administrator will add them manually. An error message is returned.^c	Ensure that you have appropriate ports named in the `/etc/services` file.
START_NET method cannot start DB2		Turn off fault monitoring to ensure that the instance does not get failed over. Log in as the instance owner, and try to start DB2 manually. Ensure that the `hadb2tab` configuration file has the correct instance type specified. For example, having a `db2nodes.cfg` file for an EE administrative instance will cause problems, and the HA agent methods will not be able to recover from this. Ensure that the `.rhosts` file exists, and has valid entries in it. Ensure that the HA-NFS file system is shared with root permissions for all machines in the cluster. Check the kernel parameters, and ensure that they are correct. Ensure that the `/etc/services` file contains entries for the instance.
The instance only works on one machine	The numeric `uid` for the instance may not be the same on each machine in the cluster. The kernel parameters may not be valid on each machine in the cluster. The `hadb2tab` file may not be the same on each machine in the cluster. Other configuration files, such as the logical host `vfstab` file, may not be the same on each machine in the cluster.	If none of these causes appears to apply, try logging in as the instance owner, and start DB2 manually. For EE instances, this should work if the logical host that is hosting the instance is being hosted by the current machine. For EEE instances, this should work from any machine in the cluster that can host the database partitions.
su - <instance> -c "db2start" does not work	The `.profile` for the instance may not be su-"friendly". There is a known problem with the Bourne shell (`/bin/sh`), in which the su command works manually, but not through the HA agent.	As root, try running this command manually, and ensure that it works before trying again through the HA agent. Switch to the Korn shell (`/bin/ksh`), if necessary.
My EEE instance cannot start, but the home directory is mounted	The HA-NFS directory may not have been exported with "root" permissions to the machines in the cluster. Both DB2 and the HA agents require this to run properly.	To test this, try to create a file (as root) under the instance owner's home directory.
Trying to access the EEE instance directory returns a "Stale NFS file handle" error	There may still be processes under the instance owner's home directory.	Unmount the instance owner's home directory, and allow the HA agent to remount it. The HA agent will remount it if the `hadb2` service is turned off and on again (see a description of the `-s` switch on the hadb2_setup command in The hadb2_setup Command).
Control methods do not run successfully through SC2.2	The `hadb2` service may not be registered with the Sun Cluster software, or it may not be turned on.	If the control methods appear to run normally from the command line, check the SYSLOG files for error messages that may help to explain the problem. Ensure that the `hadb2` service is registered with the Sun Cluster software, and that it is turned on. Running the methods manually is useful for debugging a problem.^d The methods should be run as root and given the appropriate command line arguments. If the list of logical hosts is nil, the argument should be given as "". The double quotation marks without a blank space separator denotes a blank argument. For example: hadb2_startnet log0,log1 "" 600 The first argument, `log0,log1`, tells the `hadb2_startnet` method that logical hosts `log0` and `log1` are being hosted by the current machine. The second argument is nil, which tells the `hadb2_startnet` method that there are no other logical hosts being hosted on other machines in the cluster (all of them are on the current machine). The third argument tells the method that SC2.2 will time out after 600 seconds.
User scripts do not run	The user scripts can only be run if they exist in the appropriate directories and are executable.	Check file ownership and attributes. If a script still fails to run, contact IBM service. Forward a directory listing of the script that does not run, and SYSLOG output for a failover or a cluster reconfiguration that should have run the script.
Information is not being logged to the file specified in /etc/syslog.conf		Use `touch(1)` to create the file that is specified in the `/etc/syslog.conf` file, and then restart the SYSLOG daemon.
^a Error messages that are produced when the logical host file system cannot be mounted may look something like the following: Aug 17 11:14:01 rash ID[SUNWcluster.loghost.1170]: importing data1 Aug 17 11:14:06 rash ID[SUNWcluster.scnfs.3040]: mount -F ufs -o "" /dev/vx/dsk/data1/data1-stat /log1 failed. Aug 17 11:14:07 rash ID[SUNWcluster.ccd.ccdd.5304]: error freeze cmd = /opt/SUNWcluster/bin/loghost_sync CCDSYNC_POST_ADDU LOGHOST_CM:log1:rash /etc/opt/SUNWcluster/conf/ccd.database 2 "0 1" 1 error code = 1 ^b For example: scadmin@rash(218)# ps -fe \| egrep db2 db2ee 1984 1 0 0:01 <defunct> Solution: scadmin@rash(229)# cd / scadmin@rash(230)# mv /log1 /log1.bkp scadmin@rash(231)# mkdir /log1 ^c The error message may look something like the following: SQL6030N START or STOP DATABASE MANAGER failed. Reason code "13". ^d For example, if the `hadb2_startnet` method cannot find `libdb2.so.1`, but it runs normally through the Sun Cluster software, no errors will be reported. Running the method manually results in the following: scadmin@crackle(213)# hadb2_startnet '''log0,log1' 600 ld.so.1: hadb2_startnet: fatal: libdb2.so.1: open failed: No such file or directory Killed

[ Top of Page | Previous Page | Next Page ]