|
Problem(Abstract) |
After setting up the HTTP plug-in for load balancing in a
clustered IBM® WebSphere® environment, the HTTP plug-in is not performing
failover in a timely manner or at all when a cluster member becomes
unavailable. |
|
|
|
Cause |
In most cases, the preceding behavior is observed because
of a misunderstanding of how HTTP plug-in failover works or might be due
to an improper configuration. Also, the type of Web server (multi-threaded
versus single threaded) being used can affect this behavior. |
|
|
Resolving the
problem |
The following document is designed to assist you in
understanding how HTTP plug-in failover works, along with providing you
some helpful tuning parameters and suggestions to better maximize the
ability of the HTTP plug-in to failover effectively and in a timely
manner.
Note: The following information is written specifically for the
IBM HTTP Server, however, this information in general is applicable to
other Web servers which currently support the HTTP plug-in (for example:
IIS, SunOne, Domino®, and so on).
Failover
- Background
In clustered IBM WebSphere Application Server environments, the HTTP
plug-in has the ability to provide failover in the event the HTTP plug-in
is no longer able to send requests to a particular cluster member. By
default, there are several conditions under which the HTTP plug-in will
mark a particular cluster member down and failover client requests to
another cluster member that is still able to receive connections. They are
listed as follows:
- The HTTP plug-in is unable to establish a connection to
a cluster member's Application Server transport.
- The HTTP plug-in detects a newly connected socket that
was prematurely closed by a cluster member during an active read or write.
There are several configurable settings in the plugin-cfg.xml
that can be tuned to affect how quickly the HTTP plug-in will mark a
cluster member down and failover to another cluster member.
- ConnectTimeout
The ConnectTimeout attribute of a Server element enables the HTTP plug-in
to perform non-blocking connections with a backend cluster member.
Non-blocking connections are beneficial when the HTTP plug-in is unable to
contact the destination to determine if the port is available or
unavailable for a particular cluster member.
<Server CloneID="10k66djk2"
ConnectTimeout="10" ExtendedHandshake="false" LoadBalanceWeight="1000"
MaxConnections="0" Name="Server1_WebSphere_Appserver"
WaitForContinue="false">
<Transport Hostname="server1.domain.com" Port="9091"
Protocol="http"/>
</Server> |
|
If no ConnectTimeout value is specified, the HTTP plug-in
performs a blocking connect in which the HTTP plug-in sits until an
operating system TCP timeout occurs (as long as 2 minutes depending on the
platform) and allows the HTTP plug-in to mark the cluster member
unavailable. A value of 0 causes the HTTP plug-in to perform a blocking
connect. A value greater than 0 specifies the number of seconds you want
the HTTP plug-in to wait for a successful connection. If a connection does
not occur after that time interval, the HTTP plug-in marks the cluster
member unavailable and fails over to one of the other cluster members
defined in the cluster.
Caution: In an environment with busy workload or a slow network
connection, setting this value too low could make the HTTP plug-in mark a
cluster member down falsely. Therefore, caution should be used whenever
choosing a value for ConnectTimeout. |
|
- ServerIOTimeout
The ServerIOTimeout attribute of a server element enables the
HTTP plug-in to set a time out value, in seconds, for sending requests to
and reading responses from a cluster member. If a value is not set for the
ServerIOTimeout attribute, the HTTP plug-in, by default, uses blocked I/O
to write request to and read responses from the cluster member until the
TCP connection times out. For example, if you specify:
<Server CloneID="10k66djk2"
ServerIOTimeout="120" ConnectTimeout="10" ExtendedHandshake="false"
LoadBalanceWeight="1000" MaxConnections="0"
Name="Server1_WebSphere_Appserver" WaitForContinue="false">
<Transport Hostname="server1.domain.com" Port="9091"
Protocol="http"/>
</Server> |
|
In this case, if a cluster member stops responding to
requests, the HTTP plug-in waits 120 seconds (2 minutes) before timing out
the TCP connection. Setting the ServerIOTimeout attribute to a reasonable
value enables the HTTP plug-in to time out the connection sooner, and
transfer requests to another cluster member when possible.
When selecting a value for this attribute, remember that sometimes it
might take a couple of minutes for a cluster member to process a request.
Setting the value of the ServerIOTimeout attribute too low could cause the
HTTP plug-in to send a false server error response to the client.
The ServerIOTimeout is ideal for situations where Keep-Alive connections
exist between the WebSphere Application Server and HTTP plug-in, and the
Application Server machine is abruptly disconnected from the network.
For example, without ServerIOTimeout, the HTTP plug-in would take a long
time to detect that the connection was closed abruptly on the WebSphere
Application Server machine. This is illustrated as follows:
When an application host machine is shut down abruptly, the Keep-Alive
connections between HTTP plug-in and Application Server might not get
closed completely. As a result, when the HTTP plug-in needs to route a
request to the host machine, the HTTP plug-in would use an existing
Keep-Alive connection if there was one in the pool. When plug-in sends the
request over such a connection, since the host machine had been taken down
abruptly, the HTTP plug-in machine does not receive any TCP packets to
close the connection. The HTTP plug-in request writing would not return a
failure until the connection timed out at the TCP level. The HTTP Plug-in
would then try to contact to the same application server by establishing a
new connection. The connect() call would then fail after the TCP timeout.
As a result, it could take a considerable amount of time depending on the
operating system TCP timeout setting for the HTTP plug-in to detect the
application server status and mark it down before failing over to another
application server. If there were many requests sent to the server during
this time, this fact would apply to every request.
Note: To avoid the preceding behavior, ServerIOTimeout attribute
was introduced with APAR PQ96015
and included in WebSphere Application Server V5.0.2.10 and 5.1.1.4.
Caution: When both ConnecTimeout and ServerIOTimeout are
specified, it could take as long as (ConnecTimeout + ServerIOTimeout) for
the HTTP plug-in to detect and mark a server down. |
|
- RetryInterval
An integer specifying the length of time that should elapse from the time
that a server is marked down to the time that the HTTP plug-in will retry
a connection. The default is 60 seconds.
This setting is specified in the ServerCluster element. An example of this
in the plugin-cfg.xml file is as follows:
<ServerCluster
CloneSeparatorChange="false" LoadBalance="Round Robin"
Name="Server_WebSphere_Cluster" PostSizeLimit="10000000"
RemoveSpecialHeaders="true" RetryInterval="120"> |
|
This would mean that if a cluster member were marked as
down, the HTTP plug-in would not retry it for 120 seconds.
There is no way to recommend one specific value; the value chosen depends
on your environment. For example, if you have numerous cluster members,
and one cluster member being unavailable does not affect the performance
of your application, then you can safely set the value to a very high
number.
Alternatively, if your optimum load has been calculated assuming all
cluster members to be available or if you do not have very many, then you
will want your cluster members to be retried more often to maintain the
load.
Also, take into consideration the time it takes to restart your server.
If a server takes a long time to boot up and load applications, then you
will need a longer retry interval. |
|
- PrimaryServers versus BackupServers
The HTTP plug-in can be configured for true failover by using
PrimaryServers and BackupServers Elements in the plugin-cfg.xml
configuration file.
In the following example, the plug-in will load balance between both
servers, Server1_WebSphere_Appserver and Server2_WebSphere_Appserver
defined in PrimaryServers element only. However, in the event that
bothServer1_WebSphere_Appserver and Server1_WebSphere_Appserver become
unavailable and marked down, the HTTP plug-in will then failover and start
sending requests to Server3_WebSphere_Appserver defined in the
BackupServers Element.
<ServerCluster
CloneSeparatorChange="false" LoadBalance="Round Robin"
Name="Server_WebSphere_Cluster" PostSizeLimit="10000000"
RemoveSpecialHeaders="true" RetryInterval="120">
<Server CloneID="10k66djk2" ServerIOTimeout="120"
ConnectTimeout="10" ExtendedHandshake="false" LoadBalanceWeight="1000"
MaxConnections="0" Name="Server1_WebSphere_Appserver"
WaitForContinue="false">
<Transport Hostname="server1.domain.com" Port="9091"
Protocol="http"/>
</Server>
<Server CloneID="10k67eta9" ServerIOTimeout="120"
ConnectTimeout="10" ExtendedHandshake="false" LoadBalanceWeight="999"
MaxConnections="0" Name="Server2_WebSphere_Appserver"
WaitForContinue="false">
<Transport Hostname="server2.domain.com" Port="9091"
Protocol="http"/>
</Server>
<Server CloneID="10k68xtw10" ServerIOTimeout="120"
ConnectTimeout="10" ExtendedHandshake="false" LoadBalanceWeight="998"
MaxConnections="0" Name="Server3_WebSphere_Appserver"
WaitForContinue="false">
<Transport Hostname="server3.domain.com" Port="9091"
Protocol="http"/>
</Server>
<PrimaryServers>
<Server Name="Server1_WebSphere_Appserver"/>
<Server Name="Server2_WebSphere_Appserver"/>
</PrimaryServers>
<BackupServers>
<Server Name="Server3_WebSphere_Appserver"/>
</BackupServers>
</ServerCluster> |
|
- Additional Failover Reference
Tuning
Plug-in Workload Management Failover
|
 |
|
|
|
|