TCP Timeouts:
One of the reasons for invariability of testing results in failover
situations, is due to operating system dependencies, specifically TCP
timeout values. When the plugin sends a request to a clone that is
stopped, but on an available machine, the plugin immediately forwards the
request to the next available clone. When a clone exists on a machine that
is removed from the network (network cable unplugged, powered off), the
plugin routes a request to it by using a connect() function call. Since
the machine is not available, the plugin cannot determine the clone’s
status until the operating system TCP timeout expires. Only then will the
plugin forward the request to another available clone.
The defaults for TCP Timeouts are:
- AIX: 75 seconds
- Solaris: 180 Seconds
- NT: 9 Seconds
These timeout values can be configured, however great care should be taken
when changing these settings. When altering operating system parameters,
you may be affecting other processes which may result in unintended
consequences.
If running WAS 4.02 or higher:
To overcome this problem, in WebSphere Application Server V4.0.2 there is
a new option within the plug-in configuration file that allows you to
bypass the operating system timeout.
It is now possible to add an attribute to the Server element called
ConnectTimeout which makes the plug-in use a non-blocking connect. Set
this attribute to an integer value greater than 0 to determine how long
the plug-in should wait for a response when attempting to connect to a
server. A setting of 10 will mean the plug-in waits for 10 seconds to
timeout.
The lowest value you can set ConnectTimeout to is 1 second. To determine
what setting should be used, you need to take into consideration how fast
your network and servers are. Complete some testing to see how fast your
network is, also take into account peak network traffic and peak server
usage. If the server cannot respond before the ConnectTimeout then the
plug-in will mark it as down. As this setting is done on the Server tag
you can set it for each individual clone. For instance, you have a system
with 4 clones, 2 of which are on a remote node. The remote node is on
another subnet and it sometimes takes longer for the
network traffic to reach it. You might want to set up your server group
like Example 5-17.
Example 5-17 ConnectTimeout Server attribute in plugin-cfg.xml
<ServerGroup Name="WebSvrGrp"RetryInterval="200">
<Server CloneID="tafcaflg"Name="WebClone0">
<Transport Hostname="web1"Port="9080"Protocol="http"/>
</Server>
<Server CloneID="tafcaflg"Name="WebClone1"ConnectTimeout="20">
TCP Timeouts on AIX:
On AIX, there's a parameter named tcp_keepinit, set via the AIX ‘no’
command. The command name is short for "network options" The default is 75
seconds; the units are half-seconds, so the default setting of
tcp_keepinit is actually 150:
To view:
# /usr/sbin/no -o tcp_keepinit
The output should be something like:
tcp_keepinit = 150
.
To set:
# /usr/sbin/no -d tcp_keepinit 100
TCP Timeouts on Solaris (2.6-2.8):
On Solaris, there's a parameter named tcp_ip_abort_cinterval, set via the
Solaris ndd command. The default is 3 minutes; the units are milliseconds,
so the default setting of tcp_ip_abort_cinterval is actually 180000:
To view:
# ndd /dev/tcp tcp_ip_abort_cinterval
The output should be something like:
180000
To set:
# ndd -set /dev/tcp tcp_ip_abort_cinterval
60000
TCP Timeouts on NT/2000:
Service Pack 5 adds a new registry entry, InitialRtt, which allows the
retransmission (or timeout) time to be modified from the default limit of
3000. The range is 0 - 65535 milliseconds and can be set as follows:
- Using regedit, move to:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
- From the Edit menu select New - DWORD value
- Enter a name of InitialRtt and press Enter
- Double click the new value and set to the number of milliseconds for
the timeout, e.g. 5000 for 5 seconds (the old default was 3 seconds).
Click OK
- Close the registry editor
- Restart the machine for the change to take effect
- For instance, the default value is 3,000, or 3 seconds. By
default, a connection request is retried 2 times. The total time-out is
(3+3+3) seconds, or 9 seconds.
- If this registry value is set to 6,000 (6 seconds), the
total timeout will be (6+6+6) seconds, or 18 seconds. During this time, an
application can appear to stop responding (hang).
More Documentation:
WebSphere InfoCenters: http://www-306.ibm.com/software/webservers/appserv/was/library/
IBM WebSphere Application Server Network Deployment V5.0 - Tuning
Plug-in Workload Management Failover |