PQ54436: WHEN ONE OF MULTIPLE WAS NODES IS PHYSICALLY DOWN STARTING THE ADMIN CONSOLE BRINGS UP EMPTY TOPOLOGY AND APP SERVER GOES DOWN

A fix is available
WebSphere Application Server Version 3.5 Fix Pack 7 (3.5.7)

APAR

APAR status
Closed as program error.

Error description
Environment:
WebSphere Application Server (WAS) 3.5.3 AE on 2 Solaris 2.6
      nodes sharing the same repository database
   eFix PQ43351.1 is installed
   WAS security is enabled using an LDAP server for
      authentication
   Application server is running as non-root, but it has
      read/write privilege on the sas.server.props and
      secboottrap files
.
Description:Environment:WebSphere Application Server (WAS) 3.5.3 AE on 2 Solaris 2.6nodes sharing the same repository database, eFixPQ43351.1 is installedWAS security is enabled using an LDAP server forauthenticationApplication server is running as non-root, but it hasread/write privilege on the sas.server.props andsecboottrap files.
In the above configuration, one node is physically brought down (shutdown). Bringing up the admin console on the remaining up node causes a running application server to die 5 minutes after the admin console is started. The app server attempts to restart, but fails and eventually stays down. The admin console also displays a blank topology even after waiting 15 minutes for the "Console ready" message to appear in the admin console window. This problem doesn't occur if WAS just stopped on the node that is to simulate a hardware failure.
Description:In the above configuration, one node is physically broughtdown (shutdown). Bringing up the admin console on the remainingup node causes a running application server to die 5 minutesafter the admin console is started. The app server attempts torestart, but fails and eventually stays down. The admin consolealso displays a blank topology even after waiting 15 minutes forthe "Console ready" message to appear in the admin consolewindow. This problem doesn't occur if WAS just stopped on thenode that is to simulate a hardware failure.
Local fix
Problem summary
****************************************************************
* USERS AFFECTED: All users of WebSphere Application Server    *
*                 3.5 and 4.0 who use SSL. ( secure socket     *
*                 layer )                                      *
****************************************************************
* PROBLEM DESCRIPTION: On a multi nodes domain with security   *
*                      enabled, when one node was down,        *
*                      Admin Console cannot be started on the  *
*                      working node.                           *
****************************************************************
* RECOMMENDATION:                                              *
****************************************************************
There is a sync block within ORB code which will include
socket create.  This socket create will also retry when
it does not succeed the first time.  On Solaris systems,
each socket create could take 4 to 5 minutes to time out.
This time taken by this sync block cause transaction used
by System Management to fail.
Problem conclusion
The Sync block was reduced to not include socket write
and the retry for ssl socket creation has been removed.
Temporary fix
The ORB/SSL code enclosed the socket.connect() operation and a
connection table update operation inside a Java synchronized
block.  The socket connect() operation uses the operating
systems TCP timeout (on Solaris, this defaults to 3 minutes)
before returning with a "connection failed" response.  While
inside the synchronized block, no other process can connect
to the adminserver.  In order to resolve this, the
development team removed the socket.connect() operation
from the synchronized block, but leaving the connection
table update operation.
Comments
APAR information
APAR numberPQ54436
Reported component nameWAS ADVANCED SU
Reported component ID5648C8402
Reported release350
StatusCLOSED PER
PENoPE
HIPERNoHIPER
Submitted date2001-11-06
Closed date2002-01-15
Last modified date2002-01-29

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:APAR is sysrouted FROM one or more of the following:

PQ56709

Modules/Macros
ORB
APAR is sysrouted TO one or more of the following:PQ56709Modules/Macros

Fix information
Fixed component nameWAS ADVANCED SU
Fixed component ID5648C8402

Applicable component levels
R350 PSYUP











Document Information

Product categories: Software, Application Servers, Distributed Application & Web Servers, WebSphere Application Server, General
Software version: 350
Reference #: PQ54436
IBM Group: Software Group
Modified date: 2002-01-29