PQ75605: Perodic loss of administrative functions due to Admin Server hanging.

 Fixes are available

PQ75605, 4.0.4, 4.0.5, 4.0.6: Periodic loss of Administration
4.0.7: WebSphere Application Server Version 4.0 Fix Pack 7



APAR status
Closed as program error.

Error description
Customer is seeing limited to no administration capabilities
which appear to stem from an Admin Server hanging.
Senario:
Users can't start Application Servers through WSCP or Admin
Console (the customer who reported this issue doesn't utilize
XML Config). Additionally, the users can browse object trees
through existing Admin Console or WSCP session, however,
the start/stop operation will never return.
The users can start new Admin Console or WSCP session, and
the object trees show up without any error. But any start/stop
operation towards any Application server will fail. Even force
stop doesn't work. At that time users must restart Admin Server
to be able to start/stop of the Application servers.
Further Insight:
1. The repository database is intact, and there is no JDBC
connections/db lock problem between the Admin Server and
repository database. Thus repository and JDBC connection
probably are not related to this problem.
2. From the Admin Server javacore dump we gathered when this
problem occured, we noticed that all working threads responsible
for start/stop operation were in "waiting" status. Further
investigation suggested that as long as first working thread was
waiting, all other threads will have to wait on this thread to
return before they can continue their own start/stop operation,
if somehow first thread doesn't return, those waiting threads
will have to wait infinitely.
3. The reason why first thread didn't return hasn't be
determined so far, we do have several theories at this time. One
possibility is that when user issues a start/stop command toward
an Appliation Server or an Enterprise Application, the Admin
Server will assign this task to a free working thread. Before
this working thread tries to start this Application Server or
Enterprise Application, it will make sure all the parten objects
are running. If working thread believes one of the partent
objects is not running, it will wait till it starts.
In the case the reported customer was seeing this object
appears to be Node Object. It is abnormal because when a working
thread is running, the Node, in other words, the Admin Server
must be running. We haven't determined why the working thread
believes that the node is stopped.
4. This analyse brought another problem. WebSphere Admin Server
is not designed to handle large number of parallel operations
performed by multiple admin agents. In fact, to keep repository
database in consistent, at any given time, there will be only
one operation allowed. But this customer is developing servers
which have 51 Application Servers and hundreds of Enterprise
Applications defined in one single machine as well as a single
domain. They have multiple configurator who may start multiple
WSCP sessions or Admin Consoles and try start/stop operation
concurrently. Which will likely put working threads in a
conflicted situation. The customer will need to cut this very
large domain insto smaller ones by installing multple instances
of WebSphere on this machine, and limit the number of
configurators who can access Admin Server as a long term
solution even after we fix the working thread conflicts.
Local fix
What is needed:
1. Prevent marking node off line in a single node configuration.
Multi-node configuration will continue to mark nodes off line as
appropriate.
2. Change asynchronous wait in
checkOutOfOrderContainedObjectInvocation to a timed wait and add
error handling to notify end user of Application Server
or Enterpirse Application startup failure for corrective action.
3. Ensure asynchronous notifications are sent whenever epoch is
set.
LOCAL FIX:
Stop and restart the Admin Server.
Problem summary
****************************************************************
* USERS AFFECTED: WebSphere Application Server 4.0 users with  *
*                 a heavily loaded system (more than 50        *
*                 application server).                         *
****************************************************************
* PROBLEM DESCRIPTION: Intermittent hangs occur when a system  *
*                      is heavily loaded (more than 50         *
*                      application servers).                   *
****************************************************************
* RECOMMENDATION:                                              *
****************************************************************
On heavily loaded systems, the admin server will hang in the
checkOutOfOrders method waiting for an epoch change, which will
never happen.
It was determined that whenever a module failed to start, an
epoch change was erroneously being made by the node, resulting
in the node marking itself offline.  Then, during subsequent
application server starts, the admin server will wait for an
epoch change on the node, which never takes place.
Problem conclusion
Two modifications were made:
1.  The code will prevent a node from marking itself offline.
2.  The code was modified so the indefinte wait in
CheckOutOfOrders module was changed to a timed wait and
rollback.
Temporary fix Comments
APAR information
APAR number PQ75605
Reported component name WEBSPHERE AE SO
Reported component ID 5630A2202
Reported release 400
Status CLOSED PER
PE NoPE
HIPER NoHIPER
Submitted date 2003-06-25
Closed date 2003-08-12
Last modified date 2004-03-22

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Modules/Macros
AdmnSvr Object        

SRLS

Fix information

Applicable component levels
R400 PSY    UP


Document Information


Product categories: Software > Application Servers > Distributed Application & Web Servers > WebSphere Application Server > General
Operating system(s):
Software version: 400
Software edition:
Reference #: PQ75605
IBM Group: Software Group
Modified date: Mar 22, 2004