Using TPROF data to aid in debugging high CPU and performance degradation issues for an AIX JVM
 Technote (FAQ)
 
Problem
Processes or threads may consume unexpected CPU cycles. Output from the AIX® tprof command can provide invaluable information in determining which process/thread is responsible for the high CPU usage. It provides CPU utilization information broken down by process ID, thread, library, and function. This technote will explore how to interpret tprof data to aid in debugging these type issues in a WebSphere® Application Server environment.
 
 
Solution
Data from the AIX tprof command is indispensable in debugging 100% CPU, high CPU usage, and performance degradation issues on WebSphere JVMs executing on the AIX operating system. In order to map threads in the tprof data to the Java™ stacks in Javacores, analysts will also need DBX data taken from the JVM process ID (PID). If the issue is related to CPU usage, the 100% CPU Mustgather technote will usually provide all the documentation needed to debug the problem.

The 100% CPU Mustgather technote contains a script that executes the tprof command with the following options:

tprof -skex sleep 60

The –skex parameters configure the following profiling:
-s Enables shared library profiling
-k Enables kernel profiling
-e Turns on kernel extension profiling
-x Program Specifies the program to be executed by tprof. Data collection stops when Program completes or trace is manually stopped with either trcoff or trcstop

The tprof utility will gather CPU statistics while the sleep 60 command executes for 60 seconds. On an AIX 5.2 machine, the output from the script is stored in the sleep.prof file and will have the following format:

Process Freq Total Kernel User Shared Other
======= ==== ===== ====== ==== ====== =====
wait 4 70.48 70.48 0.00 0.00 0.00
/usr/java131/jre/bin/java 7 25.04 0.01 0.00 25.02 0.01
/usr/bin/tprof 2 0.15 0.02 0.00 0.13 0.00
/home/db2inst1/sqllib/bin/db2fm 7 0.11 0.08 0.00 0.03 0.00
/usr/bin/sh 11 0.04 0.04 0.00 0.01 0.00
. . .
Process PID TID Total Kernel User Shared Other
======= === === ===== ====== ==== ====== =====
java131/jre/bin/java 20386 78913 25.00 0.00 0.00 25.00 0.00
wait 1290 1291 23.79 23.79 0.00 0.00 0.00
wait 516 517 23.60 23.60 0.00 0.00 0.00
wait 1032 1033 21.64 21.64 0.00 0.00 0.00
wait 774 775 1.47 1.47 0.00 0.00 0.00
/usr/bin/tprof 19884 112213 0.15 0.02 0.00 0.13 0.00
java131/jre/bin/java 12336 55499 0.01 0.00 0.00 0.00 0.00
java131/jre/bin/java 12336 117689 0.01 0.00 0.00 0.00 0.00

The first section presents a summation of the CPU usage based on the process name and breaks the usage into the Kernel, User, Shared, and Other library sections. In this example, the java131/jre/bin/java is consuming 25.04% of the CPU cycles during this period of time, most of which is used in the shared libraries. The “wait” process is using a total of 70.48% of the CPU all of which is in the kernel.

The second section breaks down the CPU usage by PID. The “ java131/jre/bin/java” PID 20386 is consuming 25% of the CPU in shared libraries. The Java PID 12336 is using very little CPU.

When debugging WebSphere Application Server issues, CPU usage from a Java process in the shared library generally indicates garbage collection (GC) activity. CPU usage in the Other category will usually indicate a problem in Java code. If tprof shows a suspicious process (non-wait) using CPU in the Kernel libraries, we can usually assume the usage is due to an AIX operating system issue.

The next section of data compiled by tprof will provide a break down of the CPU usage in each major category by Shared Object and Subroutine:

Total % For All Processes (SH-LIBs) = 25.22

Shared Object %
============= ======
/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a 25.00
/usr/lib/libc.a[shr.o] 0.10
/usr/lib/libpthreads.a[shr_xpg5.o] 0.08
/usr/lib/libtrace.a[shr.o] 0.03
/usr/WebSphere/AppServer/java/jre/bin/libjitc.a 0.00
/usr/lib/libC.a[shr.o] 0.00
/usr/WebSphere/AppServer/java/bin/../jre/bin/libzip.a 0.00

Profile: /usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a

Total % For All Processes (/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a) = 25.00

Subroutine % Source
========== ====== ======
.verifyHeapAfterCompaction0 25.00 ov/st/msc/gc_trace.c
.xeCreateNativeFrame 0.00 ov/xe/common/stack.c

Profile: /usr/lib/libc.a[shr.o]

Total % For All Processes (/usr/lib/libc.a[shr.o]) = 0.10

Subroutine % Source
========== ====== ======
._rec_mutex_lock 0.02 lib/libc/rec_mutex.c
.strpbrk 0.01 s/lib/libc/strpbrk.c
.strncpy 0.01 strncpy.s
.__divi64 0.01 divi64.s

From this data, you can determine the WebSphere Application Server library and the code being executed. In the above example, from the shared libraries category (SH-LIBS), the usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a is using 25% of the CPU. The verifyHeapAfterCompaction subroutine from the ov/st/msc/gc_trace.c source is responsible for all of that 25%. In this case, it appears that GC is responsible for 25% of the CPU usage during the period the tprof data was taken. This data can be invaluable in pinpointing the source of CPU usage.

Also note hat the process name from the Summary section of the tprof output shows the Java process as originating from the directory path /usr/java131/jre/bin/java instead of the /usr/WebSphere/AppServer/java/jre/bin shown in the Shared Objects and Subroutine sections:

Process Freq Total Kernel User Shared Other
======= ==== ===== ====== ==== ====== =====
wait 4 70.48 70.48 0.00 0.00 0.00
/usr/java131/jre/bin/java 7 29.04 0.01 0.00 29.02 0.01

vs.

Total % For All Processes (SH-LIBs) = 25.22

Shared Object %
============= ======
/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a 25.00

This is a defect in the tprof utility addressed by the 5.2.0.51 release of the bos.perf.tools library.

On AIX 5.1 and earlier, tprof data will be captured in a _prof.all file and the CPU usage is not quantified as a percentage of the total but the number of ticks the processes use. The first section of the _prof.all file will contain the CPU usage of each thread in the system processes:

Process PID TID Total Kernel User Shared Other
======= === === ===== ====== ==== ====== =====
java 18932 49219 5940 117 0 4667 1156
wait 1290 1291 4405 4405 0 0 0
wait 774 775 3991 3991 0 0 0
wait 516 517 3472 3472 0 0 0
. . .
/usr/bin/grep 30300 100087 1 1 0 0 0
======= === === ===== ====== ==== ====== =====
Total 24005 15755 26 6083 2141

In this example, thread ID 49219 used 5940 of the total 24005 ticks during the sample period (about 25%).

The next section of data in the _prof.all file contains the summation of CPU usage by process name. Once again, instead of percentage CPU, we see the number of ticks. In this case, the Java process used 8551 of the 24005 ticks:

Process FREQ Total Kernel User Shared Other
======= === ===== ====== ==== ====== =====
wait 4 15237 15237 0 0 0
java 42 8551 376 0 6034 2141
PatrolAgent 6 48 17 16 15 0
. . .
/usr/bin/grep 1 1 1 0 0 0
======= === ===== ====== ==== ====== =====
Total 110 24005 15755 26 6083 2141

Finally, the _prof.all file also contains the breakdown of the CPU usage grouped by category (Shared, Other, etc.), Shared Object, and Subroutine. The format is similar to the one discussed above so we will not present it again here.

When debugging CPU usage issues, you should first inspect the tprof output to determine which processes/threads are consuming the CPU, usually concentrating on Java processes. You would then determine which libraries from which the process is using the CPU. If it is Java process using CPU from the shared libraries, you can quickly check the Shared Objects and Subroutine sections to determine if GC is the culprit. This can them be verified by analyzing the verbose GC data if it is provided.

If the CPU is consumed from the Other libraries, you should then attempt to map the problem thread ID back to the javacore file to determine the source of the problem. You will need dbxtrace data taken from the JVM PID in order to do so. Detailed instructions for mapping the tprof data back to the javacores can be found in the Correlating TPROF Output to a Specific Thread in the JAVACORE using DBXTrace -a Output technote.
 
Related information
Mustgather: 100% CPU Usage on AIX Platforms
Correlating TPROF output to a Javacore Thread
 
 


Document Information


Product categories: Software > Application Servers > Distributed Application & Web Servers > WebSphere Application Server > Hangs/Performance Degradation
Operating system(s): AIX
Software version: 6.0
Software edition:
Reference #: 1210870
IBM Group: Software Group
Modified date: Jun 30, 2005