PQ56709: WHEN ONE OF MULTIPLE WAS NODES IS PHYSICALLY DOWN STARTING THE ADMIN CONSOLE BRINGS UP EMPTY TOPOLOGY AND APP SERVER GOES DOWN

APAR status
Closed as program error.

Error description
Environment:
WebSphere Application Server 3.5.3 AE on 2 Solaris 2.6
      nodes sharing the same repository database
   eFix PQ43351.1 is installed
   The Application Server security is enabled using an LDAP
   server for authentication
   Application server is running as non-root, but it has
      read/write privilege on the sas.server.props and
      secboottrap files
.
Description:
   In the above configuration, one node is physically brought
down (shutdown). Bringing up the admin console on the remaining
up node causes a running application server to die 5 minutes
after the admin console is started. The app server attempts to
restart, but fails and eventually stays down. The admin console
also displays a blank topology even after waiting 15 minutes for
the "Console ready" message to appear in the admin console
window. This problem doesn't occur if the Application Server
just stopped on the node that is to simulate a hardware failure.
Local fix Problem summary
****************************************************************
* USERS AFFECTED: All users of WebSphere Application Server    *
*                 3.5 and 4.0 who use SSL. ( secure socket     *
*                 layer )                                      *
****************************************************************
* PROBLEM DESCRIPTION: On a multi nodes domain with security   *
*                      enabled, when one node was down,        *
*                      Admin Console cannot be started on the  *
*                      working node.                           *
****************************************************************
* RECOMMENDATION:                                              *
****************************************************************
There is a sync block within ORB code which will include
socket create.  This socket create will also retry when
it does not succeed the first time.  On Solaris systems,
each socket create could take 4 to 5 minutes to time out.
This time taken by this sync block cause transaction used
by System Management to fail.
Problem conclusion
The Sync block was reduced to not include socket write
and the retry for ssl socket creation has been removed.
Temporary fix
The ORB/SSL code enclosed the socket.connect() operation and a
connection table update operation inside a Java synchronized
block.  The socket connect() operation uses the operating
systems TCP timeout (on Solaris, this defaults to 3 minutes)
before returning with a "connection failed" response.  While
inside the synchronized block, no other process can connect
to the adminserver.  In order to resolve this, the
development team removed the socket.connect() operation
from the synchronized block, but leaving the connection
table update operation.
Comments
APAR information
APAR number PQ56709
Reported component name WEBSPHERE AE SO
Reported component ID 5630A2202
Reported release 400
Status CLOSED PER
PE NoPE
HIPER NoHIPER
Submitted date 2002-01-15
Closed date 2002-01-15
Last modified date 2003-04-29

APAR is sysrouted FROM one or more of the following:
PQ54436

APAR is sysrouted TO one or more of the following:

Modules/Macros
ORB          

Fix information
Fixed component name WEBSPHERE AE SO
Fixed component ID 5630A2202

Applicable component levels
R400 PSY    UP


Document Information


Product categories: Software > Application Servers > Distributed Application & Web Servers > WebSphere Application Server > General
Operating system(s):
Software version: 400
Software edition:
Reference #: PQ56709
IBM Group: Software Group
Modified date: Apr 29, 2003