Troubleshooting Guide
This section describes how to deal with some problems you
could face when working with DB2 Enterprise - Extended Edition. It
addresses the following topics:
Important: This section represents a small sampling of the
information available from DB2 Customer Support. For a complete and
up-to-date source of DB2 information, see the DB2 Product and
Service Technical Library at
http://www.ibm.com/software/data/db2/library/.
- Symptom
- The DB2 Universal Database Enterprise - Extended Edition product will not
install.
- Action
- Ensure that:
- [ ]
- The DB2 product directory on each node has enough free disk space.
- [ ]
- All prerequisite software products are installed on the system.
- [ ]
- The same version of the product is not already installed on the
system.
To find out which node failed, check the db2diag.log on the
coordinating node for the application. If a node returns an error or
warning, the node number is identified in the SQLERRD(6) field of the
SQLCA. This number corresponds to the node number in the
db2nodes.cfg file. (If an SQL statement or API call is
successful, the node number in the SQLERRD(6) field is that of the coordinator
agent.)
- Symptom
-
The db2start returns with an error message having an identifier of
"N". For example, the error message SQL6048N is returned.
- Action
- Ensure that:
- [ ]
- Your user ID has SYSADM, SYSCTRL, or SYSMAINT authority. See the
Administration Guide: Implementation for details on these authorities.
- [ ]
- You have the same user ID, group identifier, and password set for the
instance owners on all nodes.
- [ ]
- You can use the rsh command (on UNIX platforms) from the DB2
instance ID to all nodes defined in the db2nodes.cfg file. To do
this, add an entry for the node in the $HOME/.rhosts file or the
hosts.equiv file. Ensure that the file permissions on these
files allow read access for all users.
Once these tasks are completed, you will be able to use UNIX commands such
as rsh date or the equivalent db2_all
'date' commands.
- [ ]
- The db2nodes.cfg file, located in the sqllib
subdirectory under the $HOME directory, contains the correct
information for all nodes defined in the system. Ensure that the host
names and net names are valid.
- [ ]
- The value for the DB2INSTANCE environment variable is the same for every
node and it matches the instance name that you are trying to start.
- [ ]
- You have the appropriate permissions. Check the permissions of the
tmp subdirectory under the sqllib subdirectory and, if
required, update them so that the instance ID has write permission for the
directory. If the permissions are not correct, a problem may have
occurred during installation and setup of the system.
- Symptom
-
The db2start returns with a critical system error message having an
identifier of "C". For example, the error messages SQL0902C or
SQL1042C are returned.
- Action
- Ensure that:
- [ ]
- The product was successfully installed.
- [ ]
- FCM communication is properly enabled. You must create an entry for
each DB2 logical port in the /etc/services file. For details on FCM,
see the DB2 Enterprise - Extended Edition Quick Beginnings manual
for your operating system.
- Symptom
-
The db2start returns with no response.
- Action
- Ensure that:
- [ ]
- If you are using the high-performance switch (HPS) on RS/6000 SP systems,
ensure that it is up on all nodes.
- [ ]
- Check the contents of the db2diag.log file for possible
reasons for the db2start failure.
- [ ]
- Ensure that the $HOME directory of the instance owner is
NFS-mounted across all nodes, and that NFS is running. If the
lockd and statd daemons are not running, db2start may
hang. To start the daemons, execute rc.nfs, and
ensure that this command is in the inittab subdirectory under the
etc subdirectory.
- Symptom
- Some partitions do not respond or hang.
- Action
- Ensure that:
- [ ]
- Attempt to isolate the problem by answering the following questions:
- Do all commands hang on all partitions?
- Do all commands hang on a particular partition?
If you respond "yes" to either of these questions, you should check
the hardware in your environment including the switch.
Then consider:
- Do all db2 commands hang on all partitions?
- Do all db2 commands hang on a particular partition?
There may be a process that you suspect is causing the problem. Use
the process identifier of the suspect process with the
db2_call_stack tool.
Use the db2_call_stack several minutes apart and
application snapshots to see if the application is making any progress.
See Trap Files for more information on
db2_call_stack.
- Symptom
- Some nodes cannot be started, or run extremely slowly.
- Action
- Ensure that:
- [ ]
- You installed the DB2 Enterprise - Extended Edition product on all
nodes.
Although you can install the product on an NFS-mounted file system in a
RISC/6000 cluster, performance will be compromised. You should install
the product on each node.
- [ ]
- All nodes are installed at the same level of DB2 Enterprise - Extended
Edition.
- [ ]
- NFS is running and there are enough NFS daemon (nfsd) processes assigned
on the machine where the home directory of the instance owner resides.
The number of NFS daemon (nfsd) processes should be the number of block
I/O daemon (biod) processes on a machine multiplied by the number of
machines in the instance. For example, if you had a four machine system
with ten biods, then there would be forty NFS daemon processes on the instance
home directory.
- Symptom
- A database cannot be created.
- Action
-
Determine which node is experiencing the problem. The failing node
number is stored in the SQLERRD(6) field of the SQLCA returned with the
CREATE DATABASE command. Once you have determined the
failing node, ensure that:
- [ ]
- Permission on the database directory path is correct.
The instance must have permission to write into the database directory
path.
- [ ]
- You are creating the database on a path that exists on all nodes.
- [ ]
- The file systems are mounted.
The file systems for the database may have been unmounted. Mount all
file systems again, and then try to create the database. You should
check for unmounted file systems on all nodes.
- [ ]
- There is enough disk space.
You may have created a small test database and decided that you needed very
little disk space. However, when a database is created, a certain
amount of disk space is required for the default table spaces and default
logs. See the Administration Guide:
Planning for details.
- Symptom
- DB2 or operating system commands are not recognized.
- Action
- Ensure that:
- [ ]
- Your DB2 Enterprise - Extended Edition system is properly
installed.
- [ ]
- You have sufficient authority to run the commands.
- [ ]
- If the home file system of the instance owner ID is mounted over Ethernet
or a high performance switch (on a RS/6000 SP machine), Ethernet or the switch
are running.
To determine if this is the problem, log onto the system as root
and try to access a file in the instance owner's home file system, or use
the netstat command to determine the status of the high performance
switch.
- Symptom
- Problems have been encountered while attempting to backup data on all of
the database partitions.
- Action
- If you want to use the db2_all command to back up a list of
database partitions, you must first back up the catalog node for the
database. After taking this back up, you can back up the remaining
database partitions. For example, after the catalog node is backed up,
you can issue the following db2_all command to back up the
remaining database partitions:
db2_all '<<-n< db2 backup db <database-alias>'
Where n is the node number of the catalog node.
- Symptom
- Partitions could disappear and data be lost when creating too many
partitions in Windows 2000 when using Dynamic Disk mode.
- Action
- Use Basic Disk mode instead. (Basic Disk mode is what Windows NT
uses.)
- Symptom
- When creating database partitions in Basic Disk mode, the partition size
is rounded (either up or down) to the nearest multiple of cylinders (minus a
single track in some situations). For example, if a cylinder has
approximately 7 MB then the partition size will differ +/-
3.5 MB from the requested size.
- Action
- You should create database partitions that are 3.5 MB larger than
what is specified for the table space.
For more information on using the LOAD utility or on using the AutoLoader
to split data and load it on a node, see the Data Movement Utilities Guide and Reference.
For more information on the db2atld command, type db2atld
-h from the misc subdirectory under the sqllib
subdirectory.
- Symptom
- LOAD does not load data
- Action
- Ensure that:
- [ ]
- LOAD was run by a user with the correct authority.
Users in the same group as the instance owner who have SYSADM or DBADM
authority can execute LOAD.
- [ ]
- The table or table space being loaded is not already being used by another
application.
LOAD cannot share tables, so it will not execute until it can lock all
required tables. Determine why the tables are already locked, and
ensure that the locks are released.
- [ ]
- LOAD was started on all nodes.
You must execute LOAD on all nodes for data to be loaded in
parallel. If you have a script that sends a shell to all nodes, ensure
that the LOAD execution is not serialized by the remote shell.
To check if LOAD failed on a node, look at the message files:
- If you issue the LOAD command directly, you can specify the name and
location of a message file using its MESSAGES option.
- If you use the db2atld command, message files are located in
the current working directory. There is a file for each node, and each
file is named load_log.nnn, where nnn is the node
number as specified in the db2nodes.cfg file.
- [ ]
-
The format of the data is correct.
See the Command Reference for more information on using correct data formats. Note that:
- Data cannot contain non-delimited blanks. LOAD processes
non-delimited blanks in a delimited ASCII (DEL) file as null, and cannot load
this data into a non-nullable column. You must use a string delimiter
for blank columns.
For non-delimited data, use the null indicator to indicate columns with
null values.
- Data must contain decimal points in the correct position.
For data being loaded from MVS systems, you must explicitly put the decimal
points in the data before loading it. Otherwise, they are assumed to be
at the end of the column.
- Symptom
- LOAD rejected all rows.
- Possible Cause
- Column definitions are not correct.
- Action
- When using METHOD L, ensure that the data column specification is
correct. Shifting columns can cause truncation errors or data that does
not match the table column definition.
- Symptom
- LOAD completed but no rows are loaded.
- Possible Cause
- LOAD rejected all rows.
- Action
- Check the db2load file in the temporary directory where you ran
db2atld to see if all the rows were rejected. If some rows
were loaded, see the Data Movement Utilities Guide and Reference.
- Symptom
- A load operation fails to complete.
- Possible Cause
- The load utility does not start because of a user error, such as a
nonexistent data file or invalid column names.
- Action
- Restart the load operation from the last consistency point (using the
RESTART option), or reload the entire table (using the REPLACE option).
Specify the same parameters as in the previous invocation so that the utility
can find the necessary temporary files. For additional information on
LOAD, see the Data Movement Utilities Guide and Reference.
- Symptom
- The db2atld program completed successfully, but the data is not
split.
- Possible Cause
- The db2atld program was set only to analyze the data.
- Action
- Check the AutoLoader configuration file and ensure that the
Mode parameter is not set to analyze data. This option does
not split data. It only analyzes the data and suggests a new
partitioning map. See the Data Movement Utilities Guide and Reference for details.
- Symptom
- The db2atld program completed successfully, but the data is not
split correctly.
- Action
- Ensure that:
- [ ]
- Binary data is not in a character column.
The db2atld program cannot detect binary values in a column of
any data type, unless the BINARYNUMERICS or PACKEDDECIMAL options are
specified on the load command.
- [ ]
- The SplitNodes and OutputNodes parameters are set
correctly.
Otherwise, output data files may not be correct. See the Data Movement Utilities Guide and Reference for details.
- Symptom
- The data that is loaded does not match the select count.
- Action
- Ensure that:
- [ ]
- The data is split correctly by your conversion program.
The db2atld program creates a header file to prevent data from
being loaded to the wrong node. The header information is checked by
LOAD.
If you use a conversion program to convert binary columns to the character
format, the conversion may not be the same as that done by the
db2atld program. Data will be split for the correct node,
but during the conversion data may be converted to a different value and may
not be hashed the same way.
- [ ]
- The partitioning columns match the header.
If the data is split using one set of partitioning keys, and is loaded into
a table that was partitioned on other columns, the LOAD operation will
fail. Partition column information is in the header of the split data
file. If you manually alter the header, this compromises the validation
measure built into the AutoLoader.
- Symptom
- The CREATE INDEX statement fails or hangs.
- Possible Cause
- The table space is full, or there is insufficient log space.
- Action
- Ensure that there is sufficient disk space for the index pages and for
sorting; and that there is sufficient log space.
For information on calculating disk size and on determining log size, see
the Administration Guide: Planning.
- Symptom
-
A connection appears suspended.
- Possible Cause
- The database was restarted and is still recovering.
When crash recovery is in progress, it can take a long time to recover the
database from a failure. This occurs when operations that require a
large volume of logging activities were running before the database
crashed. This is a normal situation.
- Action
- Check the db2diag.log file on all database partitions to see if
crash recovery is completed. Entries in the db2diag.log will
indicate when recovery is started, and when it is finished.
Check the log file directory on all database partitions for I/O
activity. If there is I/O activity, then crash recovery is
proceeding.
- Symptom
- The RESTART DATABASE command returns SQL1034C, SQL1042C, or
SQL1072C.
- Possible Cause
- InterProcess Communications (IPC) resources from the previous activity are
preventing crash recovery from succeeding.
- Action
- Bring the database instance down using a db2stop. Check
for DB2 processes on all partitions by issuing:
ps -ef | grep <instance name>
Check for IPC resources on all partitions by issuing:
ipcs | grep <instance name>
If either processes or IPC resources exist after the db2stop,
issue a db2_kill to delete them. After issuing the
db2_kill, check for processes and IPC resources again. If
they still exist, they will have to be removed individually using the
kill command for processes, and the ipcrm command for
IPC resources. The ipclean command can also be used before
running ipcrm on each resource.
Once all processes and IPC resources are cleared, log onto the catalog node
and issue:
db2 restart database <database-alias>
Next, issue a restart to the rest of the partitions or nodes:
db2_all "<<-n< db2 restart database <database-alias>"
Where n is the node number of the catalog node.
If the SQL1034C, SQL1042C, or SQL1072C error messages persist, contact DB2
Customer Support. You will be asked for the following
information:
- The db2diag.log from all database partitions
- The database partition(s) where RESTART DATABASE is failing
- A trace of the RESTART from the database partition(s) that are failing
- The database configuration from all nodes
- The database manager configuration from all nodes
- The db2nodes.cfg file
- The SQLOGCTL.LFH file from the failing database partition(s)
- The active log files from the failing database partition(s).
- Symptom
- The SQL1061W message is received during recovery.
- Possible Cause
- There are indoubt transactions that cannot complete, typically because one
or more nodes cannot be started. The database is recovered and open for
user connections, but the indoubt transactions are taking up memory and other
resources.
- Action
- Check the db2diag.log file of the coordinator node to see if all
nodes are started.
Resolve the indoubt transactions as soon as possible. For
information, see the discussion on heuristics in the Administration Guide: Planning.
In a partitioned database environment, DB2 breaks up SQL statements into
subsections, each of which is processed on the node that contains the relevant
data. As a result, an error may occur on a node that does not have
access to the application.
When developing applications for multiple nodes, consider the
following:
- Errors are reported in the SQLCA.
If a node returns an error or warning, a node number that maps to the
db2nodes.cfg file is placed in the SQLERRD(6) field of the
SQLCA. If an SQL statement or API call is successful, the node number
in the SQLERRD(6) field is that of the coordinator agent.
-
Indoubt transactions may be left in the database if an application ends
abnormally during 2-phase commit. An indoubt transaction pertains to
global transactions when some phases complete successfully, but the system
fails before all phases can complete. The database is left in an
inconsistent state. For information on handling indoubt transactions,
see the Administration Guide: Planning.
- An application may receive a subsequent error or warning after the problem
that caused the first error or warning is corrected. Only one severe
error is reported to the SQLCA at a time and the error is not
overwritten. Severe errors and deadlock errors are given higher
priority because they require immediate action by the coordinator
agent.
- Symptom
-
In a partitioned database environment, SQL statements from an application are
processed on different partitions. An application is not complete until
all of the SQL statements on all of the partitions are complete. If one
or more partitions does not complete, the application does not
complete.
- Action
- Ensure that:
- [ ]
- You should document the particular application(s) that appear to be hung
or slow. Determining which application and partition is causing the
delay can be done as follows:
- Check the application status on all partitions since the applications
could be in a "lock wait". This can be done by using the LIST
APPLICATIONS command to check the application status on all partitions.
- Check the output from the GET SNAPSHOT command for rows processed or
logged.
- Check the configuration parameter settings that affect the memory being
used since this could lead to performance problems. The Administration Guide: Performance should also be of assistance.
Please see Command Reference for more information on the LIST APPLICATIONS and GET SNAPSHOT
commands.
There are two distinct types of severe error:
- A DB2 process is killed because of a programming exception.
In this case, the database manager is immediately terminated at the node
and active units of work are not rolled back. When other nodes detect
the failure, depending on their relationship with the failing node they will
attempt to recover as follows:
- If the failing database partition was the catalog node of the database,
the entire database is brought down at all other nodes because all agents are
forced off the node.
In this situation, you must restart all the database partitions with
db2start and then go through the crash recovery procedure by running
RESTART DATABASE for all databases on that partition. Crash
recovery may be necessary on some of the subordinate nodes as well. See
the Command Reference for more information.
- If the failing database partition was the coordinator node for the
application, all subagents running at the node on behalf of the coordinator
node are forced off the database partition and the active unit of work is
rolled back.
In this situation, you must restart the coordinator node with db2start and
then go through the crash recovery procedure by running RESTART
DATABASE for all databases on that partition. See the Command Reference for more information.
- If the failing node was a subordinate node, any coordinator agent which
has an active unit of work that includes the failed node rolls back the unit
of work if it was not already in the COMMIT phase. If it is in the
COMMIT phase, the SQL code -279 is returned to indicate that the transaction
is in doubt.
- A problem caused the database to be marked as inconsistent.
Any attempt to access the database returns SQL code -1034 (SQL state 58031)
or SQL code -1015 (SQL state 55032).
In this case, all coordinator agents and subagents are forced off the
database. These agents then roll back the current unit of work and
disconnect from the database.
Before any application on any other node can access the database at this
node, you must put the database in a consistent state by running RESTART
DATABASE against it.
The severe error SQL code -1224 (SQL state 55032) can occur for a variety of
reasons. If you receive this message, check the SQLCA to determine
which node failed, and then the db2diag.log file for additional
details. For information, see Determining a Failing Node.
Note: In the event of a severe error where multiple
machines were involved, you may find diagnostic information in their syslog
files rather than in the db2diag.log file, if they were not able to
obtain an NFS lock on the db2diag.log file.
If you know the failing partition or node, then you can use the
db2start nodenum command. For example, if "3" is the
failing node, then:
db2start nodenum 3
If the db2start fails for the individual partition or node, then
you will need to run a db2stop and then a db2start on
all partitions. If no nodenum is specified, all partitions
defined in the node configuration file are started.
Restart the database on those partitions where it is inconsistent.
To see which partitions are consistent, look at the
database_consistent informational database parameter on each
partition. For more information on using the RESTART DATABASE command,
see the Command Reference manual.
[ Top of Page | Previous Page | Next Page ]