APAR status
Closed as program error.
Error description
The Deployment Manager (dmgr) only discovers one Node Agent when
it's in a multi-node environment. The problem is
intermittent and will usually disappear if the discovery
related trace is enabled. It also is not specific to any
particular Node Agent and can be seen to discover Node Agents in
a random fashion.
Topology Example:
Machine 1(AIX 5.1)-> DevNode1 (Running Cell Manager and Node
Agent 1)
Machine 2(AIX 5.1)-> DevNode2 (Running Node Agent 2)
L2 analysis:
-- The trace file shows the following error:
[2/24/04 16:38:15:272 CST] 129ccaea d UOW=
source=com.ibm.ws.management.discovery.DiscoveryAdapter org=IBM
prod=WebSphere component=Application Server event --
unrecognized response
It seems the dmgr is not able to recognize the response from a
Node Agent, which prevented the dmgr to discover the node
-- The cause of this "unrecognized response" points to the
queryId in the jxta:DiscoveryResponse message, as the
Step_2_FindTheSource folder/readme.txt suggests:
In the failing case, traces had the same query IDs:
[2/24/04 20:53:01:761 CST] 497c6891 DiscoveryServ d End-2-end
messaging: queryId: 1
[2/24/04 20:53:01:761 CST] 2c706891 DiscoveryServ d End-2-end
messaging: queryId: 1
In the successful case, traces had different query IDs:
[2/24/04 20:17:14:625 CST] 7f51e4a0 DiscoveryServ d End-2-end
messaging: queryId: 1
[2/24/04 20:17:14:625 CST] 5cf1a4a0 DiscoveryServ d End-2-end
messaging: queryId: 2
For some reason, the queryId was not being incremented in the
faling case.
-- The class that is responsible for incrementing the query Id
is:
com.ibm.ws.management.discovery.DiscoveryService the method
is: sendQuery()
The following line of code in sendQuery() may be causing the
problem:
long queryId = querySerialNumber++;
Note querySerialNumber is a static private long type. In the
traces, we see two threads invoking sendQuery() method. We
probably are seeing a non-threadsafe implementation here with
the static querySerialNumber field.
This also explains why turning on traces the problem does not
occur. Since both thread writes to the same log file, they need
to wait their turns to write to the log and are foced to run in
a more synchronzied fashion.
Local fix Problem summary
****************************************************************
* USERS AFFECTED: All WebSphere Application Server users who *
* are using a multi-node system. *
****************************************************************
* PROBLEM DESCRIPTION: Dmgr start up process is not able to *
* recognize a node agent in a two node *
* environment *
****************************************************************
* RECOMMENDATION: *
****************************************************************
Dmgr startup process is not able to recognize a node agent in
a two node environment. The problem disappears if discovery
related traces are turned on. This problem appears
intermittently and not on one specific node.
Problem conclusion
A non-thread safe querySerialNumber was being used which
caused problems when there are two threads invoking the
sendQuery() method. We are now using a thread safe
implementation of this value.
Temporary fix
Sent to customer.
Comments
APAR information |
APAR number |
PQ87224 |
Reported component name |
WAS BASE 5.0 |
Reported component ID |
5630A3600 |
Reported release |
00A |
Status |
CLOSED PER |
PE |
NoPE |
HIPER |
NoHIPER |
Special Attention |
NoSpecatt |
Submitted date |
2004-04-06 |
Closed date |
2004-06-28 |
Last modified date |
2004-07-08 |
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Modules/Macros
Publications Referenced
Applicable component levels |
R003 PSY |
UP |
R00A PSY |
UP |
R00H PSY |
UP |
R00I PSY |
UP |
R00P PSY |
UP |
R00S PSY |
UP |
R00W PSY |
UP |
R103 PSY |
UP |
R10A PSY |
UP |
R10H PSY |
UP |
R10I PSY |
UP |
R10P PSY |
UP |
R10S PSY |
UP |
R10W PSY |
UP |
|