Fix (APAR):  PK20304

Status:  Fix

Release:  6.0.2.9,6.0.2.8,6.0.2.7,6.0.2.6,6.0.2.5,6.0.2.4,6.0.2.3,6.0.2.2,6.0.2.1,6.0.2

Operating System:  AIX,HP-UX,i5/OS,Linux,Linux pSeries,Linux Red Hat - pSeries,Linux zSeries,OS/390,OS/400,Solaris,Windows,z/OS

Supersedes Fixes:  

CMVC Defect:  363613

Byte size of APAR:  44273

Date: 2006-07-25

Abstract:  In order to make the version 6.0.2 code function more like version 5 with quick timeouts for clients when all the EJB servers are down, we had to create two custom properties for t

Description/symptom of problem:  
PK20304 resolves the following problem:

ERROR DESCRIPTION:                                              
WLM cluster data would not be gathered on a particular cluster  
until the moment the first request for that cluster came in.    
This cause a NO_IMPLEMENT to be seen on the client once         
(sometimes 2-3 times) until the cluster data was created and    
propagated.  The NO_IMPLEMENT would not be seen after the data  
was propagated.                                                 

LOCAL FIX:                                                      
None                                                            

PROBLEM SUMMARY

USERS AFFECTED:
WebSphere Application Server users of
version 6.0.2 using Workload Management
(WLM) and concerned about client timeout
lengths or hitting an infinite loop in WLM
selection.

PROBLEM DESCRIPTION:
In order to make the version 6.0.2 code
function more like version 5 with quick
timeouts for clients when all the EJB
servers are down, we had to create two
custom properties for the callback
timeout and to enable the preload
function.

RECOMMENDATION:
None

Callback timeout was hardcoded to 3 minutes which could
cause clients to timeout waiting on requests when all
servers are down.  This is a regression of behavior from
versio n5 in which the clients would get an immeadiate
response that there were no members available.

The reason for this is that with the version 6 code, which
allowed for asynchronous updates and selection, there was no
manner for the WLM code to differentiate between a No Cluster
Data Available exception because we are just starting up and
we haven't yet initialized the data for that cluster, or there
not being any data because all the servers are down.  To
combat the scenario in which we are just starting up, the
callback timeout was added to allow a request to wait for that
cluster data to be fluffed up and then the request sent
through.  However, this left the problem described above in
that the wait would also occur when all the servers are down.

There was an additional possibility that an infinite loop in
selection could occur if the retry limit is reached.

PROBLEM CONCLUSION:                                             
After creating a custom property to let the customer determine  
the length of the callback timeout (including the ability to    
skip it altogether) in order to get around the NO_IMPLEMENT     
on first touch of a cluster, we had to backport in the          
Pre-fetch logic from the next release (which is called PreLoad  
in 6.0.x to differentiate that the code is not exactly the      
same) which when enabled will cause a node agent to preLoad     
all of the cluster data without waiting for the first request   
to come in to do so.  The combination of this custom property   
to enable the PreLoad logic and the custom property to set      
the amount of the callback timeout fixed the issue with         
clients timing out on requests that they wouldn't have in       
version 5 and still succeeding on first request of a cluster.   
                                                                
An additional fix was added to solve a possible infinite        
loop in the selection logic if the retry limit was reached.     
A minor code of 40 instead of 42 is now thrown if we ever       
reach a scenario in which we run out of retry attempts.         
The WLM code will retry at a high level if the minor code       
is 42 on a NO_IMPLEMENT, but will not with a minor code of 40.  
                                                                
In order to enable and use either of the custom properties,     
you must take the following steps:                              
                                                                
In the administrative  console click on "System Administration" 
on the left side, click on "Cell" underneath that, then click   
on "Custom Properties" in the middle frame.                     
                                                                
You'll want to create two properties:                           
                                                                
IBM_CLUSTER_CALLBACK_TIMEOUT with a value for the timeout in    
 milliseconds.  10000 is 10 seconds for example, hit apply,     
 then back out to the custom property screen and go in again to 
 create a new property (otherwise you'll overwrite the old one) 
IBM_CLUSTER_ENABLE_PRELOAD with a value of true to enable the   
 preload function.  This value must be true if the callback     
 timeout is set to zero, and is not recommended to be set to    
 true for customers with extremely large topologies, as the     
 node agent can take a much longer time to start up.  If        
 you have this set to true and see long node agent start        
 times, it is recommended to set it to false to determine if    
 the issue is in the preloading of the cluster data.            
                                                                
Save those changes and synch the config with the nodes, then    
shut everything down (dmgr NA and servers) and start it up      
again. If you have WLM trace enabled, you should see this in    
the trace logs:                                                 
                                                                
[3/23/06 15:17:59:828 CST] 0000000a ProcessRuntim >             
   loadCustomProperties Entry                                   
[3/23/06 15:17:59:828 CST] 0000000a ProcessRuntim 1             
 Loaded custom property                                         
  IBM_CLUSTER_ENABLE_PRELOAD                                    
  false/true - the value set in the console                     
[3/23/06 15:17:59:828 CST] 0000000a ProcessRuntim 1             
 Loaded custom property                                         
  IBM_CLUSTER_CALLBACK_TIMEOUT                                  
  ##### - whatever value was set in the console                 
[3/23/06 15:17:59:828 CST] 0000000a ProcessRuntim >             
   loadCustomProperties Exit                                    
                                                                
If this is seen, then the cell has loaded the custom properties 
correctly and they should be used at runtime where applicable.  
                                                                
The fix for this APAR is currently targeted for inclusion       
in fixpack 6.0.2.11.                                            
Please refer to the recommended updates page for delivery       
information:                                                    
http://www.ibm.com/support/docview.wss?rs=180&uid=swg27004980   

Directions to apply fix:  NOTE: Choose the:
1) Release the fix applies to
2) The Editions that apply
3) Delete the Editions & Methods that do not apply and this Note

Fix applies to Editions:
Release 6.0
__ Application Server (Express or BASE)
__ Network Deployment (ND)
__ WebSphere Business Integration Server Foundation (WBISF)
__ Edge Components
__ Developer
__ Extended Deployment (XD)

Install Fix to:
Method:
__ Application Server Nodes
__ Deployment Manager Nodes
__ Both

NOTE:
The user must:
* Have Administrative rights in Windows, or be the Actual Root User in a UNIX environments.
* Logged in with the same authority level when unpacking a fix, fix pack or refresh pack.
* Be at V6.0.2.2 or newer of the Update Installer. This can be checked by reviewing the level of the Update Installer in file <was_root>/updateinstaller/version.txt.

The Update Installer can be downloaded from the following link:
http://www.ibm.com/support/docview.wss?rs=180&uid=swg21205991

For detailed instructions to Extract the Update Installer see the following Technote:
http://www-1.ibm.com/support/docview.wss?rs=180&uid=swg21205400

1) Copy PKxxxxx.pak file directly to the maintenance directory

2) Shutdown WebSphere
Manually execute setupCmdLine.bat in Windows or . ./setupCmdLine.sh in Unix from the WebSphere instance that maintenance is being applied to.

3) Launch Update Installer

4) Enter the installation location of the WebSphere product you want to update.

5) Select the "Install maintenance package" operation.

6) Enter the file name of the maintenance package to install (PKxxxxx.pak file which was copied in the maintenance directory).

7) Install the maintenance package.

8) Restart WebSphere

Directions to remove fix:  NOTE:
* The user must have Administrative rights in Windows, or be the Actual Root User in a UNIX environments.
* FIXES MUST BE REMOVED IN THE ORDER THEY WERE APPLIED
* DO NOT REMOVE A FIX UNLESS ALL FIXES APPLIED AFTER IT HAVE FIRST BEEN REMOVED
* YOU MAY REAPPLY ANY REMOVED FIX

Example: If your system has fix1, fix2, and fix3 applied in that order and fix2 is to be removed, fix3 must be removed first, fix2 removed, and fix3 re-applied.

1) Shutdown WebSphere
Manually execute setupCmdLine.bat in Windows or . ./setupCmdLine.sh in Unix from the WebSphere instance that uninstall is being run against.

2) Start Update Installer

3) Enter the installation location of the WebSphere product you want to remove the fix.

4) Select  "Uninstall maintenance package" operation.

5) Enter the file name of the maintenance package to uninstall (PKxxxxx.pak).

6) UnInstall maintenance package.

7) Restart WebSphere

Directions to re-apply fix:  1) Shutdown WebSphere.

2) Follow the Fix instructions to apply the fix.

3) Restart WebSphere.

Additional Information: