|
Problem |
Processes or threads may consume unexpected
CPU cycles. Output from the AIX® tprof command can provide invaluable
information in determining which process/thread is responsible for the
high CPU usage. It provides CPU utilization information broken down by
process ID, thread, library, and function. This technote will explore how
to interpret tprof data to aid in debugging these type issues in a
WebSphere® Application Server environment. |
|
|
|
Solution |
Data from the AIX tprof command is
indispensable in debugging 100% CPU, high CPU usage, and performance
degradation issues on WebSphere JVMs executing on the AIX operating
system. In order to map threads in the tprof data to the Java™ stacks in
Javacores, analysts will also need DBX data taken from the JVM process ID
(PID). If the issue is related to CPU usage, the 100% CPU Mustgather technote will usually provide all the documentation needed to debug
the problem.
The 100% CPU Mustgather technote contains a script that executes the tprof command with the
following options:
tprof -skex sleep 60
The –skex parameters configure the following profiling:
-s |
Enables shared library
profiling |
-k |
Enables kernel profiling |
-e |
Turns on kernel extension
profiling |
-x Program |
Specifies the program to be executed
by tprof. Data
collection stops when Program
completes or trace is manually stopped with
either trcoff or trcstop |
The tprof utility will gather CPU statistics while
the sleep 60 command executes for 60
seconds. On an AIX 5.2 machine, the output from the script is stored in
the sleep.prof file and will have the
following format:
Process Freq Total Kernel User Shared
Other
======= ==== ===== ====== ==== ======
=====
wait 4 70.48 70.48 0.00 0.00
0.00
/usr/java131/jre/bin/java 7 25.04 0.01
0.00 25.02 0.01
/usr/bin/tprof 2 0.15 0.02 0.00 0.13
0.00
/home/db2inst1/sqllib/bin/db2fm 7 0.11
0.08 0.00 0.03 0.00
/usr/bin/sh 11 0.04 0.04 0.00 0.01
0.00
. . .
Process PID TID Total Kernel User
Shared Other
======= === === ===== ====== ====
====== =====
java131/jre/bin/java 20386 78913 25.00
0.00 0.00 25.00 0.00
wait 1290 1291 23.79 23.79 0.00 0.00
0.00
wait 516 517 23.60 23.60 0.00 0.00
0.00
wait 1032 1033 21.64 21.64 0.00 0.00
0.00
wait 774 775 1.47 1.47 0.00 0.00
0.00
/usr/bin/tprof 19884 112213 0.15 0.02
0.00 0.13 0.00
java131/jre/bin/java 12336 55499 0.01
0.00 0.00 0.00 0.00
java131/jre/bin/java 12336 117689 0.01
0.00 0.00 0.00 0.00
The first section presents a summation of the CPU
usage based on the process name and breaks the usage into the
Kernel,
User, Shared, and Other library sections. In this
example, the java131/jre/bin/java is consuming 25.04% of the CPU cycles during this period of
time, most of which is used in the shared libraries. The “wait” process is
using a total of 70.48% of the CPU all of which is in the kernel.
The second section breaks down the CPU usage by
PID. The “
java131/jre/bin/java” PID 20386 is
consuming 25% of the CPU in shared libraries. The Java PID 12336 is using
very little CPU.
When debugging WebSphere Application Server issues,
CPU usage from a Java process in the shared library generally indicates
garbage collection (GC) activity. CPU usage in the Other category will usually
indicate a problem in Java code. If tprof shows a suspicious process (non-wait) using CPU in the Kernel
libraries, we can usually assume the usage is due to an AIX operating
system issue.
The next section of data compiled
by tprof will provide a break down
of the CPU usage in each major category by Shared Object and
Subroutine:
Total % For All Processes (SH-LIBs) =
25.22
Shared Object %
============= ======
/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a
25.00
/usr/lib/libc.a[shr.o] 0.10
/usr/lib/libpthreads.a[shr_xpg5.o]
0.08
/usr/lib/libtrace.a[shr.o] 0.03
/usr/WebSphere/AppServer/java/jre/bin/libjitc.a
0.00
/usr/lib/libC.a[shr.o] 0.00
/usr/WebSphere/AppServer/java/bin/../jre/bin/libzip.a
0.00
Profile:
/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a
Total % For All Processes
(/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a) = 25.00
Subroutine % Source
========== ====== ======
.verifyHeapAfterCompaction0 25.00
ov/st/msc/gc_trace.c
.xeCreateNativeFrame 0.00
ov/xe/common/stack.c
Profile: /usr/lib/libc.a[shr.o]
Total % For All Processes
(/usr/lib/libc.a[shr.o]) = 0.10
Subroutine % Source
========== ====== ======
._rec_mutex_lock 0.02
lib/libc/rec_mutex.c
.strpbrk 0.01
s/lib/libc/strpbrk.c
.strncpy 0.01 strncpy.s
.__divi64 0.01 divi64.s
From this data, you can determine the WebSphere
Application Server library and the code being executed. In the above
example, from the shared libraries category (SH-LIBS), the
usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a is using 25% of the CPU. The
verifyHeapAfterCompaction subroutine from
the ov/st/msc/gc_trace.c source is
responsible for all of that 25%. In this case, it appears that GC is
responsible for 25% of the CPU usage during the period the
tprof data was taken. This data can be
invaluable in pinpointing the source of CPU usage.
Also note hat the process name from the
Summary section of
the tprof output shows the Java process
as originating from the directory path
/usr/java131/jre/bin/java instead of
the /usr/WebSphere/AppServer/java/jre/bin shown in the Shared
Objects and Subroutine sections:
Process Freq Total Kernel User Shared
Other
======= ==== ===== ====== ==== ======
=====
wait 4 70.48 70.48 0.00 0.00
0.00
/usr/java131/jre/bin/java 7 29.04 0.01
0.00 29.02 0.01
vs.
Total % For All Processes (SH-LIBs) =
25.22
Shared Object %
============= ======
/usr/WebSphere/AppServer/java/jre/bin/classic/libjvm.a
25.00
This is a defect in the tprof utility addressed by the 5.2.0.51 release of the
bos.perf.tools library.
On AIX 5.1 and earlier, tprof data will be captured
in a _prof.all file and the CPU usage is not quantified as a percentage of
the total but the number of ticks the processes use. The first section of
the _prof.all file will contain the CPU usage of each thread in the system
processes:
Process PID TID Total Kernel User
Shared Other
======= === === ===== ====== ====
====== =====
java 18932 49219 5940 117 0 4667
1156
wait 1290 1291 4405 4405 0 0 0
wait 774 775 3991 3991 0 0 0
wait 516 517 3472 3472 0 0 0
. . .
/usr/bin/grep 30300 100087 1 1 0 0
0
======= === === ===== ====== ====
====== =====
Total 24005 15755 26 6083 2141
In this example, thread ID 49219 used 5940 of the
total 24005 ticks during the sample period (about 25%).
The next section of data in the _prof.all file contains the
summation of CPU usage by process name. Once again, instead of percentage
CPU, we see the number of ticks. In this case, the Java process used 8551
of the 24005 ticks:
Process FREQ Total Kernel User Shared
Other
======= === ===== ====== ==== ======
=====
wait 4 15237 15237 0 0 0
java 42 8551 376 0 6034 2141
PatrolAgent 6 48 17 16 15 0
. . .
/usr/bin/grep 1 1 1 0 0 0
======= === ===== ====== ==== ======
=====
Total 110 24005 15755 26 6083
2141
Finally, the _prof.all file also contains
the breakdown of the CPU usage grouped by category (Shared, Other, etc.), Shared Object, and
Subroutine. The format
is similar to the one discussed above so we will not present it again
here.
When debugging CPU usage issues, you should first
inspect the tprof output to determine which processes/threads are
consuming the CPU, usually concentrating on Java processes. You would then
determine which libraries from which the process is using the CPU. If it
is Java process using CPU from the shared libraries, you can quickly check
the Shared Objects and Subroutine sections to determine if GC is the culprit. This can them be
verified by analyzing the verbose GC data if it is provided.
If the CPU is consumed from the Other libraries, you should
then attempt to map the problem thread ID back to the javacore file to
determine the source of the problem. You will need
dbxtrace data taken from the JVM PID in order
to do so. Detailed instructions for mapping the tprof
data back to the javacores can be found in the
Correlating TPROF Output to a Specific Thread in the JAVACORE
using DBXTrace -a Output technote. |
|
|