|
Problem |
MustGather for problems with WebSphere Application Server
Load Balancer high availability failover. This document is specific to the
AIX® platform, however can be used to debug on other operating systems, as
the main difference would be the network trace facility and the directory
structure. Gathering this information before calling IBM support will help
familiarize you with the troubleshooting process and save you time. |
|
Cause |
There are only two ways in which a failover will occur.
- All of the defined Heartbeats fail.
- One machine can reach more of the defined reach targets.
There are a number of reasons why a heartbeat fails. It could be the
network, it could be hardware, and so forth. A network trace (iptrace
facility in AIX) on the heartbeats might provide insight, but since you do
not know when it will happen, it might be difficult to catch. Not to
mention the trace will continue to grow and there might not be a failover
for some time. If you were to assume that capturing the packets during a
failover were feasible, you would probably only be able to tell if they
were delayed, lost, or that they did not reach their destination. It would
be difficult to suggest the reason why. If a failover occurred, the
following command might offer some information as to a failed mbuff:
Note: This will only tell you that the operating system failed to
send a packet, which might not be too helpful. |
|
Solution |
If you have already contacted support, continue on to the
component-specific MustGather information. Otherwise, click: MustGather:
Read first for all WebSphere Application Server products.
Load Balancer specific MustGather information
You have a few options for debugging why takeovers occur:
- Turn up server.log level to 5 and size to 50,000,000. Turn up
reach.log level to 5 and size to 50,000,000. This will indicate
if
a failover occurred from reachability problems.
|
dscontrol set loglevel 5
dscontrol set logsize 50000000
dscontrol man reach set loglevel 5
dscontrol man reach set logsize 50000000 |
|
- Add additional logging in the go* scripts (goActive and goStandby).
Follow these steps to add the hidden "xlogging" to the environment during
this phenomenon:
- Enter the following command on both HA partners:
- Add the following lines to the end of the goActive and goStandby
scripts.
goActive:
|
echo "After goActive: " >>
/opt/ibm/lb/servers/logs/dispatcher/goActive.log
ifconfig -a >> /opt/ibm/lb/servers/logs/dispatcher/goActive.log
netstat -an >> /opt/ibm/lb/servers/logs/dispatcher/goActive.log
netstat -nr >> /opt/ibm/lb/servers/logs/dispatcher/goActive.log
arp -a >> /opt/ibm/lb/servers/logs/dispatcher/goActive.log
dscontrol e xlogtail 25000 >>
/opt/ibm/lb/servers/logs/dispatcher/xlog.out |
|
goStandby: |
|
echo "After goStandby: " >>
/opt/ibm/lb/servers/logs/dispatcher/goStandby.log
ifconfig -a >>/opt/ibm/lb/servers/logs/dispatcher/goStandby.log
netstat -an >> /opt/ibm/lb/servers/logs/dispatcher/goStandby.log
netstat -nr >> /opt/ibm/lb/servers/logs/dispatcher/goStandby.log
arp -a >> /opt/ibm/lb/servers/logs/dispatcher/goStandby.log
dscontrol e xlogtail 25000 >>
/opt/ibm/lb/servers/logs/dispatcher/xlog.out |
|
- Iptrace the heartbeats, as mentioned previously, you might need to
periodically stop and rest the iptrace otherwise it continues to grow. So
monitor this during working hours and try to stop it as soon as they see
HA failover. The heartbeats that you will need to trace are between the
two NFAs, but the trace syntax would be as follows:
|
iptrace -i en0 -b
<output_file> |
|
Where, |
en0 |
is the interface used for
heartbeat traffic |
-b |
stands for bidirectional |
|
|
For Windows:
Use the Ethereal GUI. |
|
For Solaris: |
|
snoop -o trace.out |
|
For Linux®: |
|
tcpdump -i eth0 trace.out |
|
- These should, ideally, be done on both machines (Primary and Backup).
Note: The network trace will show the breaks in the communication
(for example: periodic slowdowns with gaps of +5 seconds). Depending on
what the iptrace shows, the answer would be to increase the time-out to
avoid the gaps or debug the performance issues (OS or network) that cause
the time-outs. Sources in the past have been extreme network loads killing
performance on the cards themselves which delays the processing of the
heartbeats, large firewall filter rule sets which delays the heartbeats
getting to the Load Balancer, network congestion which delays the packets
even getting to the box, and so on. From the iptrace, you might be able to
get an idea on what might be happening. The following command might also
show adapter errors (dropped packets, etc).
|
netstat -v |
|
The xlog will show the internal communications that are
transpiring. |
|
- Once the failover has occurred, and you have captured this in the
logs, you can turn the logging down or off:
kill <pid> ("pid" of
the iptrace)
dscontrol set loglevel 1
dscontrol man reach set loglevel 1
dscontrol e xloglevel 0 (turns it off) |
|
- One other thing that can be checked is the PMTU discovery to see if it
is clean or have a number of routes been created? In AIX, for each packet,
the OS will have to search through this list of routes to find the
appropriate one for this particular packet. For example, any packet
needing to use the default route will have to search through all of the
routes before selecting the default route thus causing added overhead (on
a per packet basis).
To turn it off:
no -o tcp_pmtu_discover=0
no -o udp_pmtu_discover=0 |
|
If you elect to do this, you will need
to go into the rc.tcpip file, located in the /etc
directory, and add the preceding two commands to the end of that file so
it will be turned off every time the machine is rebooted.
Additional note: Starting with version 3.6 the heartbeats have
been encapsulated using GRE. Previously they were "kamikaze" style packets
"fin-syn-urge", only sending data and not expecting a reply. |
|
- Provide the following information during the problem, from both
machines:
- xlog.out
- server.log (collected with loglevel 5)
- goActive.log
- goStandby.log
- configuration files
- highavailchange script (if applicable)
- goActive and goStandby scripts
- network trace in binary format
- netstat -m
- netstat -v
- Follow instructions to send
diagnostic information to IBM support.
For a listing of all technotes, downloads, and educational materials
specific to the Load Balancer component, search the WebSphere
Edge Server and WebSphere
Application Server support site. |
|
|
|
|
Cross Reference information |
Segment |
Product |
Component |
Platform |
Version |
Edition |
Application Servers |
WebSphere Application Server |
Edge Component |
AIX, HPUX, Linux, Solaris, Windows |
6.0 |
Base, Network Deployment |
|
|
|
|