OSE WLM Failover in the Plugin

Technote (FAQ)
Problem
In a multinode environment when WebSphere is cloned horizontally across nodes, pulling out the network cable from one of the nodes induces failover. Obviously, the amount of time taken for this failover is critical for many customers. The instructions that follow offer some explanation of the workings of such a situation for the plugin and the possibility for optimizing the failover.
Solution
OSE WLM failover in the plugin is dependant on each and every concurrent IHS httpd single threaded process individually detecting and marking a clone as down. Furthermore, existing connections must not only detect the failure of a clone, but must also go through the TCP/IP timeout routine. Without PQ60039 and explicitly setting the ose.connect.timeout parameter in the bootstrap.properties file, each request routed to the machine will wait for the AIX system TCP/IP timeout interval before the connection times out (by default 75 seconds). This delays failover. Unfortunately, this cannot be avoided, but only occurs once for each single threaded IHS httpd process.

As such, we saw the following in the plugin logs on Clone 0 when Clone 1 was unplugged from the network. The different colours distinguish between the different threads / requests.

Sat Jun 29 17:21:49 2002 - 00004fd4 00000000 - Warning - open_inet_client_socket
port 8998 sd 6 failed after 1 attempts, error 78
Sat Jun 29 17:21:49 2002 - 00004fd4 00000000 - Error - mediate_service_to_clone:
Error 8 from clone 1, uri /wps/portal/
Sat Jun 29 17:21:56 2002 - 0000409a 00000000 - Warning - open_inet_client_socket
port 8998 sd 6 failed after 1 attempts, error 78
Sat Jun 29 17:21:56 2002 - 0000409a 00000000 - Error - mediate_service_to_clone:
Error 8 from clone 1, uri /wps/portal/
Sat Jun 29 17:22:10 2002 - 0000587e 00000000 - Warning - open_inet_client_socket
port 8998 sd 6 failed after 1 attempts, error 78
Sat Jun 29 17:22:10 2002 - 0000587e 00000000 - Error - mediate_service_to_clone:
Error 8 from clone 1, uri /wps/portal/

1. So the necessary steps needed to evaluate the OSE WLM failover in the plugin, are as follows:
Install PQ60039 and PQ61926, with a suggested starting ose.conect.timeout value of 45 seconds in each plugin's bootstrap.properties file. Reducing the value still further may be achievable, but findings suggest that reducing the value too far will result in sockets potentially becoming stuck in the FIN_WAIT_1 state.

Also of concern to us is the ose.failure.retry_timer parameter in the bootstrap.properties file. This value dictates the duration each single threaded process should hold off before retrying to see if the clone is back up. Potentially, this value should be increased from 160 seconds, to say 600 seconds. Otherwise, we will again find ourselves cycling through the routine of marking the clone as down (if it is still down) and waiting on the TCP/IP timeout, sooner than we desire.

As such, include the following in each bootstrap.properties file.

ose.connection.domain.retries=1
ose.connection.inet.retries=1
ose.failure.retry_timer=600
ose.connect.timeout=45

2. Turning session affinity off seems to result in better performance also.
This is also expected behavior. If multiple requests go to Bianca and establish affinity with
that machine, once the machine is down, requests will automatically be directed to the down server first. These will then go thru the process discussed above. However, if session affinity is turned off, a round-robin workload management routine will be used. Therefore only every other request will go to the down clone and the others would not experience the delay.











    Document Information

    Product categories: Software, Application Servers, Distributed Application & Web Servers, WebSphere Application Server, Workload Management (WLM)
    Operating system(s): AIX, HPUX, Linux, Solaris
    Software version: 3.5
    Software edition: Advanced, Enterprise, Standard
    Reference #: 1175275
    IBM Group: Software Group
    Modified date: 2004-07-26