MustGather information for high CPU problems

The documentation required to diagnose high CPU problems with web server processes includes

The ServerDoc tool provided with ihsdiag automates much of the work of gathering this information. The user runs ServerDoc and provides the IHS installation directory and other information; ServerDoc creates a new directory to hold the required documentation, and stores information in that new directory.

Once the ServerDoc tool has completed, the user should copy any remaining log files and configuration files used by the web server and the plug-in into the new directory, and send in the directory to IBM support.

Known issues to check for first

IHS 2.0 or above on HP-UX

The most common cause of runaway CPU usage on HP-UX is an issue with the HP-UX sendfile() system call. The easiest work-around is to use EnableSendfile Off in the web server configuration file. The directive may already be there, but commented out. If sendfile use is not disabled, please make the configuration change, restart the web server, and see if the problem occurs again.

HP reports that this sendfile() issue is resolved, which would allow sendfile to be used without the related high CPU problem occurring. A relevant fix is now available for the problem of sendfile() returning zero. The relevant HP-UX patch for HP-UX11iV1 (11.11)) is PHNE_33159 and for HP-UX 11iV2U2 (11.23) it is in PHKL_31500 (which is included in the Update 2 patch bundle so no need to seperately apply there).

any level of IHS, with original WebSphere 5.0 plug-in, and ESI cache enabled

This level of the WebSphere plug-in had a problem which would result in a CPU loop in the ESI cache support. It is resolved by disabling the ESI cache or upgrading to a more recent plug-in.

IHS 2.0 or above on HP-UX or Solaris with WebSphere plug-in 5.1 or later

The WebSphere plug-in without the fix for APAR PK39645 may misinterpret some system error codes and enter a busy loop.

Attaching with tusc (HP-UX) or truss (Solaris) may show a looping system call such as write returning an error repeatedly.

Any platform with IHS 6.0.2.x and WebSphere Plugin 6.0.2.7

WebSphere Plugin version 6.0.2.7 (Bld date: Jan 30 2006, 17:02:44) has a potential busy loop when URL rewriting is used for session IDs. This is resolved with PK19649 in Plugin 6.0.2.9 and later.

Known configuration issues

Identifying high CPU processes

Special concerns for threaded processes on older Linux distributions
Customers may be using single-threaded IHS 1.3 on older Linux distributions such as RedHat 7.x. If third-party modules are loaded which create threads in the IHS processes, this high CPU tool may not be able to analyze those processes due to limitations of the system tools gdb and strace. In addition to this high CPU tool, also run the hang documentation tool, as it may be able to retrieve additional information. Submit both the high CPU doc and hang doc to IHS support.

Operating systems provide special programs for displaying the CPU usage of processes running on the system. These programs include ps, top, topas, nmon, and others. When a high CPU problem is suspected, one of these programs should be used to determine which processes are actively consuming high CPU. If a web server process is the highest consumer of CPU, documentation about the suspected problem should be submitted to IBM HTTP Server support for analysis.

Because ps is universally available, the example below will use ps to show how to identify the high CPU process.

ps invocations which show percent CPU

Platform ps invocation
AIX
ps -A -o pid,ppid,pcpu,time,args
Linux
ps -eo pid,ppid,pcpu,time,fname,cmd
Solaris
ps -A -o pid,ppid,pcpu,time,comm
HP-UX
UNIX95=1 ps -A -o pid,ppid,pcpu,time,args

example of using ps to find high CPU processes

Substitute the appropriate ps command for your platform.

# ps -A -o pid,ppid,pcpu,time,args
  PID  PPID  %CPU        TIME COMMAND
    0     0   0.0    00:07:56 swapper
    1     0   0.0    00:01:18 /etc/init
  516     0  96.4 89-10:46:02 wait
  774     0   0.0    00:00:09 reaper
 1032     0   0.0    00:00:15 xmgc
 1290     0   0.0    00:00:27 netm
 1548     0   0.0    00:24:25 gil
 1806     0   0.0    00:00:00 wlmsched
 2850     0   0.0    00:00:00 lvmbb
 3170  8258   0.0    00:00:00 /usr/sbin/sshd -D
 3410     1   0.9    19:02:01 /usr/sbin/syncd 60
 3650     1   0.0    00:00:00 /usr/dt/bin/dtlogin -daemon
 3964     1   0.0    00:00:00 /usr/lib/errdemon
 4148  3650   0.0    00:00:00 dtlogin <:0>        -daemon
 4420  3650   0.0    00:07:31 /usr/lpp/X11/bin/X -D /usr/lib/X11//rgb  -T -force :0 -auth /var/dt/A:0-svd
 5160     1   0.0    00:00:00 /usr/ccs/bin/shlap
 5422     1   0.0    00:00:00 /usr/sbin/srcmstr
 5962  5422   0.0    00:05:03 /opt/freeware/sbin/prngd -f /dev/egd-pool -m 666
 6210  5962          00:00:00 <defunct>
 6456  5962          00:00:00 <defunct>
 6714  5962          00:00:00 <defunct>
 7004     1   0.0    00:00:00 /usr/sbin/biod 8
 7226  5422   0.0    00:00:03 /usr/sbin/syslogd
 7482  5422   0.0    00:00:32 sendmail: accepting connections
 7740  5422   0.0    00:00:22 /usr/sbin/portmap
 7998  5422   0.0    00:00:00 /usr/sbin/inetd
 8258  5422   0.0    00:03:50 /usr/sbin/sshd -D
 8516     0   0.0    00:00:01 rtcmd
 8772     0   0.0    00:00:00 kbiod
 9064     1   0.0    00:00:00 /usr/sbin/uprintfd
 9300     1   0.0    00:00:00 /usr/sbin/rpc.statd
 9564     1   0.0    00:00:00 /usr/sbin/rpc.lockd
 9820     1   0.0    00:00:20 /usr/sbin/cron
10078  5422   0.0    00:00:01 /usr/sbin/qdaemon
10320     1   0.0    00:06:52 rpc.lockd
10578     0   0.0    00:00:00 cash
10836     1   0.0    00:00:26 aump
11094  5422   0.0    00:00:14 /usr/sbin/automountd
11384  5422   0.0    00:00:00 /usr/sbin/writesrv
12400     1   0.0    00:00:00 /usr/IMNSearch/httpdlite/httpdlite -r /etc/IMNSearch/httpdlite/httpdlite.c
13434  5422   0.0    00:00:08 /usr/sbin/rsct/bin/IBM.AuditRMd
13692     1   0.0    00:00:00 /usr/sbin/getty /dev/console
13946  5422   0.0    00:00:01 /bin/sh /etc/gsa_updmn
14450     1   0.0    00:00:00 /usr/sbin/getty /dev/lft0
14988  5422   0.0    00:00:20 /usr/sbin/rsct/bin/rmcd -r
15224  5422   0.0    00:22:29 /usr/opt/ifor/bin/i4llmd -b -n wcclwts -l /var/ifor/llmlg
15484 15224   0.0    00:00:00 /usr/opt/ifor/bin/i4llmd -b -n wcclwts -l /var/ifor/llmlg
15968 13946   0.0    00:00:00 sleep 14400
16278  7998   0.0    00:00:00 rpc.ttdbserver 100083 1
16888  5422   0.0    00:00:08 /usr/sbin/rsct/bin/IBM.ERrmd
17050  5422   0.0    00:00:08 /usr/sbin/rsct/bin/IBM.ServiceRMd
19404     1   0.0    00:00:00 frca
20298 33448   0.0    00:00:00 /usr/IBMIHS-1.3.28/bin/httpd -d /usr/IBMIHS-1.3.28
21772  4148   0.2    02:23:00 dtgreet
23438     1   0.0    00:00:00 /usr/lpp/diagnostics/bin/diagd
23920  5422   0.0    00:00:08 /usr/sbin/rsct/bin/IBM.CSMAgentRMd
24432  5422   0.0    00:00:05 /usr/sbin/rsct/bin/ctcasd
25988 45448   0.0    00:00:00 emacs ph/ihsdiag/ServerDoc/Platform.java
28506 25988   0.0    00:00:00 /opt/freeware/lib/emacs/20.7/powerpc-ibm-aix4.3.3.0/emacsserver
31872 33448   0.0    00:00:00 /usr/IBMIHS-1.3.28/bin/httpd -d /usr/IBMIHS-1.3.28
32442 33448   0.0    00:00:00 /usr/IBMIHS-1.3.28/bin/httpd -d /usr/IBMIHS-1.3.28
32662  3170   0.0    00:00:00 /usr/sbin/sshd -D
33448     1   0.0    00:00:00 /usr/IBMIHS-1.3.28/bin/httpd -d /usr/IBMIHS-1.3.28
35482 33448  40.0    00:09:12 /usr/IBMIHS-1.3.28/bin/httpd -d /usr/IBMIHS-1.3.28
37564 38136   0.0    00:00:00 -bash
37854 33448   0.0    00:00:00 /usr/IBMIHS-1.3.28/bin/httpd -d /usr/IBMIHS-1.3.28
38136 42606   0.0    00:00:00 /usr/sbin/sshd -D
39222 37564   0.0    00:00:00 ps -A -o pid,ppid,pcpu,time,args
42606  8258   0.0    00:00:00 /usr/sbin/sshd -D
45270 33448   0.0    00:00:00 /usr/IBMIHS-1.3.28/bin/httpd -d /usr/IBMIHS-1.3.28
45448 32662   0.0    00:00:00 -bash

The two processes highlighted are the only ones with a high current CPU utilization (third column in the ps output). This example is from AIX, where wait is a special process which represents all idle CPU on the system. So we have to ignore the wait process. This leaves process 35482, which is an IBM HTTP Server process.

Some confirmation that 35482 is actively consuming high amounts of CPU can be done by examining the forth column in the ps output over time.

# ps -A -o pid,ppid,pcpu,time,args | grep 35482
35482 33448  96.2    00:10:28 /usr/IBMIHS-1.3.28/bin/httpd -d /usr/IBMIHS-1.3.28

(wait 10 seconds or so)

# ps -A -o pid,ppid,pcpu,time,args | grep 35482
35482 33448  96.1    00:10:35 /usr/IBMIHS-1.3.28/bin/httpd -d /usr/IBMIHS-1.3.28

(wait 10 seconds or so)

[trawick@b80-2 trawick]$ ps -A -o pid,ppid,pcpu,time,args | grep 35482
35482 33448  96.6    00:10:44 /usr/IBMIHS-1.3.28/bin/httpd -d /usr/IBMIHS-1.3.28

The subsequent checks for process 35482 show that cumulative CPU time (the fourth column) continues to increase. At this point, the suspect process is clearly identified, and we should proceed to collecting information about the proces using the instructions below.

Making sure required support programs are available

Please refer to these instructions for verifying that required support programs are installed.

Running the tool

Run the tool as root to avoid any permissions problems with obtaining backtraces or reading files, such as log files and configuration files. (More information about the requirement to run this tool as root is available here.)

ServerDoc is passed in four parameters for gathering high CPU documentation:

  1. GatherHighCpuDoc
  2. the name of the IHS installation directory (e.g., /usr/HTTPServer)
  3. the process id of the high CPU process to examine
  4. the address of a non-SSL port handled by the web server (e.g., 127.0.0.1:80), or "-" if there is no non-SSL port
# java -jar ServerDoc.jar GatherHighCpuDoc /path/to/IHS 35482 127.0.0.1:80

Redhat Enterprise Linux 3.0 and SuSE SLES 9 - Some performance tools on these platforms may identify specific thread "pids" for high CPU. The high CPU tool cannot currently accept thread pids on these levels of Linux. Be sure to find the true process id for the entire process with high CPU, and specify that true process id on the invocation of the high CPU tool. The sample ps invocations shown above will only show real process ids which can be used with the high CPU tool.

The tool creates a new directory which contains a timestamp in the name, and the high CPU documentation will be saved in that directory.

determining the value of the non-SSL address parameter

If the IHS installation only supports SSL, then use - for this parameter. Otherwise, specify an IP address and port which can be used to reach the server from the local machine without using SSL.

Use the following table to determine the value of the non-SSL address parameter based on the form of a non-SSL Listen directive used in your configuration:

Listen directive looks like this use this for address parameter
(no non-SSL ports) -
Listen 80 127.0.0.1:80
Listen port 127.0.0.1:port
Listen 192.168.1.15:80 192.168.1.15:80
Listen ipaddress:port ipaddress:port
Listen myhostname:80 myhostname:80

a sample run

For this example, IHS is installed in /usr/IBMIHS-1.3.28, the high CPU process is 35482, the non-SSL port can be reached from the web server machine on address 127.0.0.1:8080, and ihsdiag was unpacked into directory /root/ihsdiag-1.3.4.

# java -jar /root/ihsdiag-1.3.4/ServerDoc.jar GatherHighCpuDoc /usr/IBMIHS-1.3.28 \
35482 127.0.0.1:8080
Tracing process for 10 seconds...

Seconds remaining before gathering information again:
60...54...48...42...36...30...24...18...12...6...

Tracing process for 10 seconds...

Seconds remaining before gathering information again:
30...27...24...21...18...15...12...9...6...3...

Tracing process for 10 seconds...

Reports, log files, and configuration files have been saved to directory
  HighCpuDoc.200501201122
If you have additional log files or configuration files, copy them there
before packing up the directory.
Web server log and conf files other than the default will have to be copied manually.
WebSphere plug-in conf and log files will have to be copied manually.

Hint for packing up the directory:
  tar -cf HighCpuDoc.200501201122.tar HighCpuDoc.200501201122
  gzip HighCpuDoc.200501201122.tar
[trawick@b80-2 ServerDoc]$ ls -l HighCpuDoc.200501201122/
total 11479
-rw-r--r--   1 root     system        32986 Jan 20 11:22 access_log
-rw-r--r--   1 root     system         7129 Jan 20 11:22 apachectl
-rw-r--r--   1 root     system         2251 Jan 20 11:22 error_log
-rw-r--r--   1 root     system       662246 Jan 20 11:22 httpd
-rw-r--r--   1 root     system        22835 Jan 20 11:22 httpd.conf
-rw-r--r--   1 root     system       197962 Jan 20 11:24 log
-rw-r--r--   1 root     system         1624 Jan 20 11:24 report
-rw-r--r--   1 root     system      1641481 Jan 20 11:22 trace.0
-rw-r--r--   1 root     system      1642385 Jan 20 11:24 trace.1
-rw-r--r--   1 root     system      1650769 Jan 20 11:24 trace.2
-rwxr-xr-x   1 root     system         1442 Jan 20 11:22 traceprocess.sh

Copying other web server and plug-in files

The next step is to copy any other web server or plug-in configuration files and logs into the new HighCpuDoc directory. Here is a list of files to copy if they are being used:

Saving the documentation directory

The last step is to pack up and compress the documentation directory using zip, tar followed by gzip, or tar followed by compress. The easiest way is to cut and paste the messages displayed by ServerDoc previously which showed the commands to use. The suggested commands will vary by platform. On z/OS, for example, compress will be suggested instead of gzip.

a sample run

# tar -cf HighCpuDoc.200501201122.tar HighCpuDoc.200501201122
# gzip HighCpuDoc.200501201122.tar

The resulting compressed file is the file to send to IBM support.

Understanding the root requirement

When gathering information on high CPU problems, the tool must attach to a live web server process to obtain information about the state of that process.

If the web server is started as root, then that process will be owned by root or by the web server user id (e.g., nobody or www). Only root has the authority to attach to any of the web server processes, so it is easiest if the tool itself is run as root. If the web server administrator does not have authority to log in or switch user to root, a simple script can be created to gather the high CPU documentation, and the system administrator can give the web server administrator sudo access to that script. sudo is a third-party tool available without cost for all appropriate platforms.

If the web server is not started as root, there are no such concerns, and the high CPU documentation tool may be run by the user id which starts the web server.

If the tool is run as non-root and it is unable to gather the required information, the problem will have to be recreated. It may not be possible to determine if this problem occurred until the documentation has been analyzed by IBM HTTP Server support.