MustGather: WebSphere Load Balancer High Availability failover
 Technote (FAQ)
 
Problem
MustGather for problems with WebSphere Application Server Load Balancer high availability failover. This document is specific to the AIX® platform, however can be used to debug on other operating systems, as the main difference would be the network trace facility and the directory structure. Gathering this information before calling IBM support will help familiarize you with the troubleshooting process and save you time.
 
Cause
There are only two ways in which a failover will occur.
  1. All of the defined Heartbeats fail.
  2. One machine can reach more of the defined reach targets.

There are a number of reasons why a heartbeat fails. It could be the network, it could be hardware, and so forth. A network trace (iptrace facility in AIX) on the heartbeats might provide insight, but since you do not know when it will happen, it might be difficult to catch. Not to mention the trace will continue to grow and there might not be a failover for some time. If you were to assume that capturing the packets during a failover were feasible, you would probably only be able to tell if they were delayed, lost, or that they did not reach their destination. It would be difficult to suggest the reason why. If a failover occurred, the following command might offer some information as to a failed mbuff:

netstat -m

Note: This will only tell you that the operating system failed to send a packet, which might not be too helpful.
 
Solution
If you have already contacted support, continue on to the component-specific MustGather information. Otherwise, click: MustGather: Read first for all WebSphere Application Server products.


Load Balancer specific MustGather information
You have a few options for debugging why takeovers occur:
  1. Turn up server.log level to 5 and size to 50,000,000. Turn up
    reach.log level to 5 and size to 50,000,000. This will indicate if
    a failover occurred from reachability problems.

    dscontrol set loglevel 5
    dscontrol set logsize 50000000
    dscontrol man reach set loglevel 5
    dscontrol man reach set logsize 50000000

  2. Add additional logging in the go* scripts (goActive and goStandby). Follow these steps to add the hidden "xlogging" to the environment during this phenomenon:
    1. Enter the following command on both HA partners:

      dscontrol e xloglevel 5

    2. Add the following lines to the end of the goActive and goStandby scripts.

      goActive:

      echo "After goActive: " >> /opt/ibm/lb/servers/logs/dispatcher/goActive.log
      ifconfig -a >> /opt/ibm/lb/servers/logs/dispatcher/goActive.log
      netstat -an >> /opt/ibm/lb/servers/logs/dispatcher/goActive.log
      netstat -nr >> /opt/ibm/lb/servers/logs/dispatcher/goActive.log
      arp -a >> /opt/ibm/lb/servers/logs/dispatcher/goActive.log
      dscontrol e xlogtail 25000 >> /opt/ibm/lb/servers/logs/dispatcher/xlog.out  

      goStandby:

      echo "After goStandby: " >> /opt/ibm/lb/servers/logs/dispatcher/goStandby.log
      ifconfig -a >>/opt/ibm/lb/servers/logs/dispatcher/goStandby.log
      netstat -an >> /opt/ibm/lb/servers/logs/dispatcher/goStandby.log
      netstat -nr >> /opt/ibm/lb/servers/logs/dispatcher/goStandby.log
      arp -a >> /opt/ibm/lb/servers/logs/dispatcher/goStandby.log
      dscontrol e xlogtail 25000 >> /opt/ibm/lb/servers/logs/dispatcher/xlog.out

  3. Iptrace the heartbeats, as mentioned previously, you might need to periodically stop and rest the iptrace otherwise it continues to grow. So monitor this during working hours and try to stop it as soon as they see HA failover. The heartbeats that you will need to trace are between the two NFAs, but the trace syntax would be as follows:

    iptrace -i en0 -b <output_file>

    Where,
    en0 is the interface used for heartbeat traffic
    -b stands for bidirectional

    For Windows:
    Use the Ethereal GUI.

    For Solaris:

    snoop -o trace.out

    For Linux®:

    tcpdump -i eth0 trace.out

  4. These should, ideally, be done on both machines (Primary and Backup).

    Note: The network trace will show the breaks in the communication (for example: periodic slowdowns with gaps of +5 seconds). Depending on what the iptrace shows, the answer would be to increase the time-out to avoid the gaps or debug the performance issues (OS or network) that cause the time-outs. Sources in the past have been extreme network loads killing performance on the cards themselves which delays the processing of the heartbeats, large firewall filter rule sets which delays the heartbeats getting to the Load Balancer, network congestion which delays the packets even getting to the box, and so on. From the iptrace, you might be able to get an idea on what might be happening. The following command might also show adapter errors (dropped packets, etc).

    netstat -v

    The xlog will show the internal communications that are transpiring.

  5. Once the failover has occurred, and you have captured this in the logs, you can turn the logging down or off:

    kill <pid>  ("pid" of the iptrace)
    dscontrol set loglevel 1
    dscontrol man reach set loglevel 1
    dscontrol e xloglevel 0 (turns it off)

  6. One other thing that can be checked is the PMTU discovery to see if it is clean or have a number of routes been created? In AIX, for each packet, the OS will have to search through this list of routes to find the appropriate one for this particular packet. For example, any packet needing to use the default route will have to search through all of the routes before selecting the default route thus causing added overhead (on a per packet basis).

    To turn it off:

    no -o tcp_pmtu_discover=0
    no -o udp_pmtu_discover=0

    If you elect to do this, you will need to go into the rc.tcpip file, located in the /etc directory, and add the preceding two commands to the end of that file so it will be turned off every time the machine is rebooted.

    Additional note: Starting with version 3.6 the heartbeats have been encapsulated using GRE. Previously they were "kamikaze" style packets "fin-syn-urge", only sending data and not expecting a reply.

  7. Provide the following information during the problem, from both machines:
    • xlog.out
    • server.log (collected with loglevel 5)
    • goActive.log
    • goStandby.log
    • configuration files
    • highavailchange script (if applicable)
    • goActive and goStandby scripts
    • network trace in binary format
    • netstat -m
    • netstat -v

  8. Follow instructions to send diagnostic information to IBM support.

For a listing of all technotes, downloads, and educational materials specific to the Load Balancer component, search the WebSphere Edge Server and WebSphere Application Server support site.
 
Related information
Submitting information to IBM support
Steps to getting support
MustGather: Read first
Troubleshooting guide
 
 
Cross Reference information
Segment Product Component Platform Version Edition
Application Servers WebSphere Application Server Edge Component AIX, HPUX, Linux, Solaris, Windows 6.0 Base, Network Deployment
 
 


Document Information


Product categories: Software > Application Servers > Edge Servers > WebSphere Edge Server > Load Balancer
Operating system(s): HP-UX
Software version: 6.0.1
Software edition:
Reference #: 1218379
IBM Group: Software Group
Modified date: Sep 27, 2005