PQ75605: Perodic loss of administrative functions due to Admin Server hanging. | |||||||||||||||||||||||||||||||||||
![]() |
|||||||||||||||||||||||||||||||||||
![]() APAR status Closed as program error. Error description Customer is seeing limited to no administration capabilities which appear to stem from an Admin Server hanging. Senario: Users can't start Application Servers through WSCP or Admin Console (the customer who reported this issue doesn't utilize XML Config). Additionally, the users can browse object trees through existing Admin Console or WSCP session, however, the start/stop operation will never return. The users can start new Admin Console or WSCP session, and the object trees show up without any error. But any start/stop operation towards any Application server will fail. Even force stop doesn't work. At that time users must restart Admin Server to be able to start/stop of the Application servers. Further Insight: 1. The repository database is intact, and there is no JDBC connections/db lock problem between the Admin Server and repository database. Thus repository and JDBC connection probably are not related to this problem. 2. From the Admin Server javacore dump we gathered when this problem occured, we noticed that all working threads responsible for start/stop operation were in "waiting" status. Further investigation suggested that as long as first working thread was waiting, all other threads will have to wait on this thread to return before they can continue their own start/stop operation, if somehow first thread doesn't return, those waiting threads will have to wait infinitely. 3. The reason why first thread didn't return hasn't be determined so far, we do have several theories at this time. One possibility is that when user issues a start/stop command toward an Appliation Server or an Enterprise Application, the Admin Server will assign this task to a free working thread. Before this working thread tries to start this Application Server or Enterprise Application, it will make sure all the parten objects are running. If working thread believes one of the partent objects is not running, it will wait till it starts. In the case the reported customer was seeing this object appears to be Node Object. It is abnormal because when a working thread is running, the Node, in other words, the Admin Server must be running. We haven't determined why the working thread believes that the node is stopped. 4. This analyse brought another problem. WebSphere Admin Server is not designed to handle large number of parallel operations performed by multiple admin agents. In fact, to keep repository database in consistent, at any given time, there will be only one operation allowed. But this customer is developing servers which have 51 Application Servers and hundreds of Enterprise Applications defined in one single machine as well as a single domain. They have multiple configurator who may start multiple WSCP sessions or Admin Consoles and try start/stop operation concurrently. Which will likely put working threads in a conflicted situation. The customer will need to cut this very large domain insto smaller ones by installing multple instances of WebSphere on this machine, and limit the number of configurators who can access Admin Server as a long term solution even after we fix the working thread conflicts.Local fix What is needed: 1. Prevent marking node off line in a single node configuration. Multi-node configuration will continue to mark nodes off line as appropriate. 2. Change asynchronous wait in checkOutOfOrderContainedObjectInvocation to a timed wait and add error handling to notify end user of Application Server or Enterpirse Application startup failure for corrective action. 3. Ensure asynchronous notifications are sent whenever epoch is set. LOCAL FIX: Stop and restart the Admin Server.Problem summary **************************************************************** * USERS AFFECTED: WebSphere Application Server 4.0 users with * * a heavily loaded system (more than 50 * * application server). * **************************************************************** * PROBLEM DESCRIPTION: Intermittent hangs occur when a system * * is heavily loaded (more than 50 * * application servers). * **************************************************************** * RECOMMENDATION: * **************************************************************** On heavily loaded systems, the admin server will hang in the checkOutOfOrders method waiting for an epoch change, which will never happen. It was determined that whenever a module failed to start, an epoch change was erroneously being made by the node, resulting in the node marking itself offline. Then, during subsequent application server starts, the admin server will wait for an epoch change on the node, which never takes place.Problem conclusion Two modifications were made: 1. The code will prevent a node from marking itself offline. 2. The code was modified so the indefinte wait in CheckOutOfOrders module was changed to a timed wait and rollback.Temporary fix Comments
APAR is sysrouted FROM one or more of the following: APAR is sysrouted TO one or more of the following: Modules/Macros
SRLS
|
Document Information |
Product categories: Software > Application Servers >
Distributed Application & Web Servers > WebSphere Application
Server > General
Operating system(s):
Software version: 400
Software edition:
Reference #: PQ75605
IBM Group: Software Group
Modified date: Mar 22, 2004
(C) Copyright IBM Corporation 2000, 2006. All Rights Reserved.