|
Problem(Abstract) |
Processes or threads may consume unexpected CPU cycles.
Output from the AIX® tprof command can information about CPU utilization
by process ID, thread, library, and function. This technote explores how
to interpret tprof data to aid in debugging such issues in a WebSphere®
Application Server environment. |
|
|
|
Resolving the
problem |
Data from the AIX tprof command is indispensable in
debugging 100% CPU, high CPU usage, and performance degradation issues on
WebSphere JVMs executing on the AIX operating system. In order to map
threads in the tprof data to the Java™ stacks in Javacores, analysts will
also need DBX data taken from the JVM process ID (PID). If the issue is
related to CPU usage, the 100% CPU
Mustgather technote will usually provide all the documentation needed
to debug the problem.
The 100% CPU
Mustgather technote contains a script that executes the tprof command
with the following options:
tprof -skex sleep 60
The –skex parameters configure the following profiling:
-s |
Enables shared library profiling |
-k |
Enables kernel profiling |
-e |
Turns on kernel extension profiling |
-x Program |
Specifies the program to be executed by tprof. Data
collection stops when Program completes or trace is manually
stopped with either trcoff or trcstop |
The tprof utility gathers CPU statistics while the sleep 60
command executes for 60 seconds. On an AIX 5.2 machine, the output from
the script is stored in the sleep.prof file and will have the
following format:
Process
Freq Total Kernel User
Shared Other
=======
==== ===== ====== ====
====== =====
wait
4 70.48 70.48
0.00 0.00 0.00
/usr/java131/jre/bin/java 7
25.04 0.01 0.00 25.02 0.01
/usr/bin/tprof
2 0.15 0.02 0.00
0.13 0.00
/home/db2inst1/sqllib/bin/db2fm 7 0.11
0.08 0.00 0.03 0.00
/usr/bin/sh
11 0.04 0.04 0.00
0.01 0.00
...
Process
PID TID Total Kernel User Shared
Other
=======
=== === ===== ====== ==== ======
=====
java131/jre/bin/java 20386 78913
25.00 0.00 0.00 25.00 0.00
wait
1290 1291 23.79 23.79 0.00
0.00 0.00
wait
516 517 23.60 23.60
0.00 0.00 0.00
wait
1032 1033 21.64 21.64 0.00
0.00 0.00
wait
774 775 1.47 1.47
0.00 0.00 0.00
/usr/bin/tprof 19884 112213
0.15 0.02 0.00 0.13 0.00
java131/jre/bin/java 12336 55499
0.01 0.00 0.00 0.00 0.00
java131/jre/bin/java 12336 117689 0.01
0.00 0.00 0.00 0.00
The first section presents a summation of CPU usage based on the process
name and breaks the usage into the Kernel, User,
Shared, and Other library sections. In this example,
the java131/jre/bin/java is consuming 25.04% of the CPU
cycles during this period of time, most of which is used in the shared
libraries. The “wait” process is using a total of 70.48% of the CPU, all
of which is in the kernel.
The second section breaks down CPU usage by PID. The “
java131/jre/bin/java” PID 20386 is consuming 25% of the CPU in
shared libraries. The Java PID 12336 is using very little CPU.
When debugging WebSphere Application Server issues, CPU usage from a Java
process in the shared library generally indicates garbage collection (GC)
activity. CPU usage in the Other category usually indicates a
problem in Java code. If tprof shows a suspicious process
(non-wait) using CPU in the Kernel libraries, you can usually assume the
usage is due to an AIX operating system issue.
The next section of data compiled by tprof provides a break
down of CPU usage in each major category by Shared Object and
Subroutine:
Total % For All Processes (SH-LIBs) = 25.22
Shared Object
%
=============
======
/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a
25.00
/usr/lib/libc.a[shr.o]
0.10
/usr/lib/libpthreads.a[shr_xpg5.o]
0.08
/usr/lib/libtrace.a[shr.o]
0.03
/usr/WebSphere/AppServer/java/jre/bin/libjitc.a
0.00
/usr/lib/libC.a[shr.o]
0.00
/usr/WebSphere/AppServer/java/bin/../jre/bin/libzip.a
0.00
Profile: /usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a
Total % For All Processes
(/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a) = 25.00
Subroutine
% Source
==========
======
======
.verifyHeapAfterCompaction0
25.00 ov/st/msc/gc_trace.c
.xeCreateNativeFrame
0.00 ov/xe/common/stack.c
Profile: /usr/lib/libc.a[shr.o]
Total % For All Processes (/usr/lib/libc.a[shr.o]) = 0.10
Subroutine
% Source
==========
======
======
._rec_mutex_lock
0.02
lib/libc/rec_mutex.c
.strpbrk
0.01 s/lib/libc/strpbrk.c
.strncpy
0.01 strncpy.s
.__divi64
0.01 divi64.s
From this data, you can determine the WebSphere Application Server
library and the code being executed. In the above example, from the shared
libraries category (SH-LIBS), you can see that
usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a is using
25% of the CPU. The verifyHeapAfterCompaction subroutine from the
ov/st/msc/gc_trace.c source is responsible for all of that 25%.
In this case, it appears that GC is responsible for 25% of the CPU usage
during the period the tprof data was taken. This data can be
invaluable in pinpointing the source of CPU usage.
Also note that the process name from the Summary section of the
tprof output shows the Java process as originating from the
directory path /usr/java131/jre/bin/java instead of
the /usr/WebSphere/AppServer/java/jre/bin shown in the
Shared Objects and Subroutine sections:
Process
Freq Total Kernel User
Shared Other
=======
==== ===== ====== ====
====== =====
wait
4 70.48 70.48
0.00 0.00 0.00
/usr/java131/jre/bin/java 7
29.04 0.01 0.00 29.02 0.01
vs.
Total % For All Processes (SH-LIBs) = 25.22
Shared Object
%
=============
======
/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a
25.00
This is a defect in the tprof utility addressed by the 5.2.0.51
release of the bos.perf.tools library.
On AIX 5.1 and earlier, tprof data is captured in a _prof.all file
and the CPU usage is not quantified as a percentage of the total but the
number of ticks the processes use. The first section of the
_prof.all file contains the CPU usage of each thread in the system
processes:
Process PID
TID Total Kernel User Shared Other
======= ===
=== ===== ====== ==== ====== =====
java 18932
49219 5940 117 0
4667 1156
wait 1290
1291 4405 4405 0
0 0
wait 774
775 3991 3991 0
0 0
wait 516
517 3472 3472 0
0 0
...
/usr/bin/grep 30300 100087
1 1 0 0
0
======= ===
=== ===== ====== ==== ====== =====
Total
24005 15755
26 6083 2141
In this example, thread ID 49219 used 5940 of the total 24005 ticks
during the sample period (about 25%).
The next section of data in the _prof.all file contains the
summation of CPU usage by process name. Once again, instead of percentage
CPU, we see the number of ticks. In this case, the Java process used 8551
of the 24005 ticks:
Process FREQ Total
Kernel User Shared Other
======= ===
===== ====== ==== ====== =====
wait
4 15237 15237 0
0 0
java 42
8551 376 0 6034
2141
PatrolAgent 6
48 17 16 15
0
...
/usr/bin/grep 1
1 1 0 0
0
======= ===
===== ====== ==== ====== =====
Total 110
24005 15755 26 6083 2141
Finally, the _prof.all file also contains the breakdown of the CPU
usage grouped by category (Shared, Other, etc.), Shared
Object, and Subroutine. The format is similar to the one
discussed above.
When debugging CPU usage issues, you should first inspect the tprof
output to determine which processes/threads are consuming the CPU, usually
concentrating on Java processes. Then, determine the libraries from which
the process is using the CPU. If it is Java process using CPU from the
shared libraries, you can quickly check the Shared Objects and
Subroutine sections to determine if GC is the culprit. This can
then be verified by analyzing the verbose GC data if it is provided.
If the CPU is consumed from the Other libraries, you should
attempt to map the problem thread ID back to the javacore file to
determine the source of the problem. You will need dbxtrace data
taken from the JVM PID in order to do so. Detailed instructions for
mapping the tprof data back to the javacores can be found in Correlating
TPROF Output to a Specific Thread in the JAVACORE using DBXTrace -a
Output. |
|
|
|