WebSphere plugin failover delayed when system is physically unavailable

Technote (FAQ)
Problem
When using WebSphere Application Server, many customer's test failover by "pulling the plug on a node" which is different than just having a application server stopped. When a node is powered off TCP timeouts are a factor in the time it takes for failover to occur.
Solution
TCP Timeouts:
One of the reasons for invariability of testing results in failover situations, is due to operating system dependencies, specifically TCP timeout values. When the plugin sends a request to a clone that is stopped, but on an available machine, the plugin immediately forwards the request to the next available clone. When a clone exists on a machine that is removed from the network (network cable unplugged, powered off), the plugin routes a request to it by using a connect() function call. Since the machine is not available, the plugin cannot determine the clone’s status until the operating system TCP timeout expires. Only then will the plugin forward the request to another available clone.

The defaults for TCP Timeouts are:
  • AIX: 75 seconds
  • Solaris: 180 Seconds
  • NT: 9 Seconds
These timeout values can be configured, however great care should be taken when changing these settings. When altering operating system parameters, you may be affecting other processes which may result in unintended consequences.

If running WAS 4.02 or higher:
To overcome this problem, in WebSphere Application Server V4.0.2 there is a new option within the plug-in configuration file that allows you to bypass the operating system timeout.
It is now possible to add an attribute to the Server element called ConnectTimeout which makes the plug-in use a non-blocking connect. Set this attribute to an integer value greater than 0 to determine how long the plug-in should wait for a response when attempting to connect to a server. A setting of 10 will mean the plug-in waits for 10 seconds to timeout.
The lowest value you can set ConnectTimeout to is 1 second. To determine what setting should be used, you need to take into consideration how fast your network and servers are. Complete some testing to see how fast your network is, also take into account peak network traffic and peak server usage. If the server cannot respond before the ConnectTimeout then the plug-in will mark it as down. As this setting is done on the Server tag you can set it for each individual clone. For instance, you have a system with 4 clones, 2 of which are on a remote node. The remote node is on another subnet and it sometimes takes longer for the
network traffic to reach it. You might want to set up your server group like Example 5-17.

Example 5-17 ConnectTimeout Server attribute in plugin-cfg.xml
<ServerGroup Name="WebSvrGrp"RetryInterval="200">
<Server CloneID="tafcaflg"Name="WebClone0">
<Transport Hostname="web1"Port="9080"Protocol="http"/>
</Server>
<Server CloneID="tafcaflg"Name="WebClone1"ConnectTimeout="20">


TCP Timeouts on AIX:
On AIX, there's a parameter named tcp_keepinit, set via the AIX ‘no’ command. The command name is short for "network options" The default is 75 seconds; the units are half-seconds, so the default setting of tcp_keepinit is actually 150:

To view:
# /usr/sbin/no -o tcp_keepinit
The output should be something like:
    tcp_keepinit = 150
.
To set:
    # /usr/sbin/no -d tcp_keepinit 100

TCP Timeouts on Solaris (2.6-2.8):
On Solaris, there's a parameter named tcp_ip_abort_cinterval, set via the Solaris ndd command. The default is 3 minutes; the units are milliseconds, so the default setting of tcp_ip_abort_cinterval is actually 180000:

To view:
    # ndd /dev/tcp tcp_ip_abort_cinterval
The output should be something like:
    180000

To set:
    # ndd -set /dev/tcp tcp_ip_abort_cinterval 60000

TCP Timeouts on NT/2000:
Service Pack 5 adds a new registry entry, InitialRtt, which allows the retransmission (or timeout) time to be modified from the default limit of 3000. The range is 0 - 65535 milliseconds and can be set as follows:
  1. Using regedit, move to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
  2. From the Edit menu select New - DWORD value
  3. Enter a name of InitialRtt and press Enter
  4. Double click the new value and set to the number of milliseconds for the timeout, e.g. 5000 for 5 seconds (the old default was 3 seconds). Click OK
  5. Close the registry editor
  6. Restart the machine for the change to take effect
    • For instance, the default value is 3,000, or 3 seconds. By default, a connection request is retried 2 times. The total time-out is (3+3+3) seconds, or 9 seconds.
    • If this registry value is set to 6,000 (6 seconds), the total timeout will be (6+6+6) seconds, or 18 seconds. During this time, an application can appear to stop responding (hang).

More Documentation:
WebSphere published their performance recommendations in the WebSphere Application Server 4.0 Advanced Edition InfoCenter: http://www.ibm.com/software/webservers/appserv/doc/v40/ae/infocenter/index.html
See section 9, Tuning, for further recommendations of tasks like using ndd on Solaris.

WebSphere 4.0 Scalability
http://www.redbooks.ibm.com/redpieces/pdfs/sg246192.pdf
see Section 5.7.3 for more information on the ConnectTimeout parameter.











    Document Information

    Product categories: Software, Application Servers, Distributed Application & Web Servers, WebSphere Application Server, Workload Management (WLM)
    Operating system(s): Multi-Platform
    Software version: 3.5, 4.0, 5.0, 5.1, 6.0
    Software edition: Edition Independent
    Reference #: 1052862
    IBM Group: Software Group
    Modified date: 2003-04-23