PQ56709: WHEN ONE OF MULTIPLE WAS NODES IS PHYSICALLY DOWN STARTING THE ADMIN CONSOLE BRINGS UP EMPTY TOPOLOGY AND APP SERVER GOES DOWN | |||||||||||||||||||||||||||||||||||||
![]() |
|||||||||||||||||||||||||||||||||||||
APAR status Closed as program error. Error description Environment: WebSphere Application Server 3.5.3 AE on 2 Solaris 2.6 nodes sharing the same repository database eFix PQ43351.1 is installed The Application Server security is enabled using an LDAP server for authentication Application server is running as non-root, but it has read/write privilege on the sas.server.props and secboottrap files . Description: In the above configuration, one node is physically brought down (shutdown). Bringing up the admin console on the remaining up node causes a running application server to die 5 minutes after the admin console is started. The app server attempts to restart, but fails and eventually stays down. The admin console also displays a blank topology even after waiting 15 minutes for the "Console ready" message to appear in the admin console window. This problem doesn't occur if the Application Server just stopped on the node that is to simulate a hardware failure.Local fix Problem summary **************************************************************** * USERS AFFECTED: All users of WebSphere Application Server * * 3.5 and 4.0 who use SSL. ( secure socket * * layer ) * **************************************************************** * PROBLEM DESCRIPTION: On a multi nodes domain with security * * enabled, when one node was down, * * Admin Console cannot be started on the * * working node. * **************************************************************** * RECOMMENDATION: * **************************************************************** There is a sync block within ORB code which will include socket create. This socket create will also retry when it does not succeed the first time. On Solaris systems, each socket create could take 4 to 5 minutes to time out. This time taken by this sync block cause transaction used by System Management to fail.Problem conclusion The Sync block was reduced to not include socket write and the retry for ssl socket creation has been removed.Temporary fix The ORB/SSL code enclosed the socket.connect() operation and a connection table update operation inside a Java synchronized block. The socket connect() operation uses the operating systems TCP timeout (on Solaris, this defaults to 3 minutes) before returning with a "connection failed" response. While inside the synchronized block, no other process can connect to the adminserver. In order to resolve this, the development team removed the socket.connect() operation from the synchronized block, but leaving the connection table update operation.Comments
APAR is sysrouted FROM one or more of the following: PQ54436 APAR is sysrouted TO one or more of the following: Modules/Macros
|
Document Information |
Product categories: Software > Application Servers >
Distributed Application & Web Servers > WebSphere Application
Server > General
Operating system(s):
Software version: 400
Software edition:
Reference #: PQ56709
IBM Group: Software Group
Modified date: Apr 29, 2003
(C) Copyright IBM Corporation 2000, 2006. All Rights Reserved.