If the workload management component is not properly distributing
the workload across servers in multi-node configuration, use the following
options to isolate the problem.
Note: This topic references one or more of the application
server log files. As a recommended alternative, you can configure
the server to use the High Performance Extensible Logging (HPEL) log
and trace infrastructure instead of using SystemOut.log , SystemErr.log, trace.log, and activity.log files on distributed and IBM® i systems. You can also use
HPEL in conjunction with your native z/OS® logging facilities. If you are using HPEL, you can access
all of your log and trace information using the LogViewer command-line
tool from your server profile bin directory. See the information
about using HPEL to troubleshoot applications for more information
on using HPEL.
Eliminate environment or configuration
issues
Determine if the servers are capable of serving the
applications for which they have been enabled. Identify the cluster
that has the problem.
- Are there network connection problems with the members of the
cluster or the administrative servers, for example deployment manager
or node agents?
- If so, ping the machines to ensure that they
are properly connected to the network.
- Is there other activity on the machines where the servers are
installed that is impacting the servers ability to service a request?
For example, check the processor utilization as measured by the task
manager, processor ID, or some other outside tool to see if:
- It is not what is expected, or is erratic rather than constant.
- It shows that a newly added, installed, or upgraded member of
the cluster is not being utilized.
- Are all of the application servers you started on each node running,
or are some stopped?
- Are the applications installed and operating?
- If the problem relates to distributing workload across container-managed
persistence (CMP) or bean-managed persistence (BMP) enterprise beans,
have you configured the supporting JDBC providers and JDBC data source
on each server?
If you are experiencing workload management problems related
to HTTP requests, such as HTTP requests not being served by all members
of the cluster, be aware that the HTTP plug-in balances the load across
all servers that are defined in the PrimaryServers list if affinity
has not been established. If you do not have a PrimaryServers list
defined then the plug-in load balances across all servers that are
defined in the cluster if affinity has not been established. If affinity
has been established, the plug-in should go directly to that server
for all requests.
For workload management problems relating
to enterprise bean requests, such as enterprise bean requests not
getting served by all members of a cluster:
- Are the weights set to the allowed values?
- For the cluster in question, log onto the administrative console
and:
- Select .
- Select your cluster from the list.
- Select Cluster members.
- For each server in the cluster, click on server_name and note the assigned weight of the server.
- Ensure that the weights are within the valid range of 0-20. If
a server has a weight of 0, no requests are routed to it. Weights
greater than 20 are treated as 0.
The remainder of this article deals with enterprise bean
workload balancing only. For more help on diagnosing problems in distributing
web (HTTP) requests, view the "Web server plug-in troubleshooting
tips" and "Web resource does not display" topics.
![[IBM i]](../images/iseries.gif)
Browse log files
for WLM errors and CORBA minor codes
If you still encounter
problems with enterprise bean workload management, the next step is
to check the activity log for entries that show:
- A server that has been marked unusable more than once and remains
unusable.
- All servers in a cluster have been marked bad and remain unusable.
A Location Service Daemon (LSD) has been marked
unusable more than once and remains unusable.
![[IBM i]](../images/iseries.gif)
To do this, use the Log and Trace
Analyzer to open the service log (activity.log) on the affected servers, and look for the following entries:
To do this, open the service log on the affected
servers, and look for the following entries:
- WWLM0061W: An error was encountered sending a request to
cluster member member and that member has been
marked unusable for future requests to the cluster cluster.
Note: It is not unusual for a server to be marked unusable.
The server may be tagged unusable for normal operational reasons,
such as a ripple start being executed while there is still a load
on the server from a client.
- WWLM0062W: An error was encountered sending a request to
cluster member member that member has been marked
unusable, for future requests to the cluster cluster two or more times.
- WWLM0063W: An error was encountered attempting to use the
LSD LSD_name to resolve an object reference for
the cluster cluster and has been marked unusable
for future requests to that cluster.
- WWLM0064W: Errors have been encountered attempting to send
a request to all members in the cluster cluster and all of the members have been marked unusable for future requests
that cluster.
- WWLM0065W: An error was encountered attempting to update
a cluster member server in cluster cluster, as it was not reachable from the deployment manager.
- WWLM0067W: Client is signalled to retry a request. A server
request could not be transparently retried by WLM because of exception:{0}
In attempting to service a request, WLM encountered a condition that
would not allow the request to be transparently resubmitted. The originating
exception is being caught, and a new CORBA.TRANSIENT with minor code
0x49421042 (SERVER_SIGNAL_RETRY) is being thrown to the client.
If any of these warning are encountered, follow the user
response given in the log. If, after following the user response,
the warnings persist, look at any other errors and warnings in the
Log and Trace Analyzer on the affected servers to look for:
- A possible user response, such as changing a configuration setting.
- Base class exceptions that might indicate a product defect.
You may also see exceptions with "CORBA" as part
of the exception name, since WLM uses CORBA (Common Object Request
Broker Architecture) to communicate between processes. Look for a
statement in the exception stack specifying a "minor code".
These codes denote the specific reason a CORBA call or response could
not complete. WLM minor codes fall in range of 0x4921040 - 0x492104F.
For an explanation of minor codes related to WLM, see the topic "Reference: Generated API documentation" for the package and class
com.ibm.websphere.wlm.WsCorbaMinorCodes.
![[IBM i]](../images/iseries.gif)
Analyze PMI data
The purpose for analyzing the PMI data is to
understand the workload arriving for each member of a cluster. The
data for any one member of the cluster is only useful within the context
of the data of all the members of the cluster.
Use the Tivoli® Performance Viewer to verify that, based on the weights
assigned to the cluster members (the steady-state weights), each server
is getting the correct proportion of the requests.
To use the
Tivoli Performance Viewer to capture PMI metrics, in the Tivoli Performance
Viewer product navigation complete the following actions:
- Select Data Collection in the tree view. Servers that do
not have PMI enabled are grayed out.
- For each server that data you wish to collect data on, click Specify...
- You can now enable the metrics. Set the monitoring level to low on the Performance Monitoring Setting panel
- Click OK
- You must hit Apply for the changes you have made to be
saved.
WLM PMI metrics can be viewed on a server by server basis.
In the Tivoli Performance Viewer select . By default the
data is shown in raw form in a table, collected every 10 seconds,
as an aggregate number. You can also choose to see the data as a delta
or rate, add or remove columns, clear the buffer, reset the metrics
to zero, and change the collection rate and buffer size.
After
you have obtained the PMI data, you should calculate the percentage
of numIncomingRequests for each member of the cluster to the total
of the numIncomingRequests of all members of the cluster. A comparison
of this percentage value to the percentage of weights directed to
each member of the cluster provides an initial look at the balance
of the workload directed to each member of a cluster.
In addition
to the numIncomingRequests two other metrics show how work is balanced
between the members of a cluster, numincomingStrongAffinityRequests
and numIncomingNonWLMObjectRequests. These two metrics show the number
of requests directed to a specific member of a cluster that could
only be serviced by that member.
For example, consider a 3-server
cluster. The following weights are assigned to each of these three
servers:
- Server1 = 5
- Server2 = 3
- Server3 = 2
Allow our cluster of servers to start servicing requests,
and wait for the system to reach a steady state, that is the number
of incoming requests to the cluster equals the number of responses
from the servers. In such a situation, we would expect that the percentage
of requests routed to each server to be:
- % routed to Server1 = weight1 / (weight1+weight2+weight3) = 5/10
or 50%
- % routed to Server2 = weight2 / (weight1+weight2+weight3) = 3/10
or 30%
- % routed to Server3 = weight3 / (weight1+weight2+weight3) = 2/10
or 20%
Now let us consider a case where there are no incoming
requests with neither strong affinity nor any non-WLM object requests.
In this scenario, let us assume that the PMI metrics gathered
show the number of incoming requests for each server are:
- numIncomingRequestsServer1 = 390
- numIncomingRequestsServer2 = 237
- numIncomingRequestsServer3 = 157
Thus, the total number of requests coming into the cluster
is: numIncomingRequestsCluster = numIncomingRequestsServer1 + numIncomingRequestsServer2
+ numIncomingRequestsServer3 = 784
numincomingStrongAffinityRequests
= 0
numIncomingNonWLMObjectRequests = 0
Can we decide
based on this data if WLM is properly balancing the incoming requests
among the servers in our cluster? Since there are no requests with
strong affinity, the question we need to answer is, are the requests
in the ratios we expect based on the assigned weights? The computation
to answer that question is straightforward:
- % (actual) routed to Server1 = 390 / 784 = 49.8%
- % (actual) routed to Server2 = 237 / 784 = 30.2%
- % (actual) routed to Server3 = 157 / 784 = 20.0%
So WLM is behaving as designed, as the data are completely what
is expected, based on the weights assigned the servers.
Now
let us consider a 3-server cluster. We have assigned the following
weights to each of these three servers:
- Server1 = 5
- Server2 = 3
- Server3 = 2
Allow this cluster of servers to start servicing requests
and wait for the system to reach a steady state, that is the number
of incoming requests to the cluster equals the number of responses
from the servers. In such a situation, we would expect that the percentage
of requests that are routed to Server1-3 would be:
- % routed to Server1 = weight1 / (weight1+weight2+weight3) = 5/15
or 1/3 of the requests.
- % routed to Server2 = weight2 / (weight1+weight2+weight3) = 5/15
or 1/3 of the requests.
- % routed to Server3 = weight3 / (weight1+weight2+weight3) = 5/15
or 1/3 of the requests.
In this scenario, let us assume that the PMI metrics gathered
show the number of incoming requests for each server are:
- numIncomingRequestsServer1 = 1236
- numIncomingRequestsServer2 = 1225
- numIncomingRequestsServer3 = 1230
Thus, the total number of requests coming into the cluster:
- numIncomingRequestsCluster = numIncomingRequestsServer1 + numIncomingRequestsServer2
+ numIncomingRequestsServer3 = 3691
- numincomingStrongAffinityRequests = 445, and that all 445 requests
are aimed at Server1.
- numIncomingNonWLMObjectRequests = 0.
In this case, we see that the number of requests was not
evenly split among the three servers, as expected. Instead, the distribution
is:
- % (actual) routed to Server1 = 1236 / 3691= 33.49%
- % (actual) routed to Server2 = 1225 / 3691= 33.19%
- % (actual) routed to Server3 = 1230 / 3691= 33.32%
However, the correct interpretation of this data is the
routing of requests is not perfectly balanced because Server1 had
several hundred strong affinity requests. WLM attempts to compensate
for strong affinity requests directed to 1 or more servers by distributing
new incoming requests preferentially to servers that are not participating
in transactional affinity, to compensate for those servers that are
participating in transactions. In the case of incoming requests with
strong affinity and non-WLM object requests, the analysis would be
analogous to this case.
If, after you have analyzed the PMI
data and accounted for transactional affinity and non-WLM object requests,
the percentage of actual incoming requests to servers in a cluster
to do not reflect the assigned weights, this indicates that requests
are not being properly balanced. If this is the case, it is recommended
that you repeat the steps described above for eliminating environment
and configuration issues and browsing log files before proceeding.
Resolve problem or contact IBM support
![[IBM i]](../images/iseries.gif)
If the PMI data
or client logs indicate an error in WLM, collect the following information
and contact IBM support.
If the client logs indicate
an error in WLM, collect the following information and contact IBM
support.
- A detailed description of your environment.
- A description of the symptoms.
![[IBM i]](../images/iseries.gif)
The SystemOut.logs and SystemErr.logs
files for all servers in the cluster.
The server log files for all servers in the cluster.
![[IBM i]](../images/iseries.gif)
The activity.log file.
![[IBM i]](../images/iseries.gif)
The First Failure Data Capture log files.
![[IBM i]](../images/iseries.gif)
The PMI metrics.
- A description of what the client is attempting to do, and a description
of the client. For example, 1 thread, multiple threads, servlet, J2EE
client, etc.
If none of these steps solves the problem, check to see
if the problem has been identified and documented using the links
in the "Diagnosing and fixing problems: Resources for learning" topic. If you do not see a problem that resembles yours, or if the
information provided does not solve your problem, contact IBM support
for further assistance.
![[AIX Solaris HP-UX Linux Windows]](../images/dist.gif)
If you do not find your problem listed there,
contact IBM Support.
For current information available from IBM Support
on known problems and their resolution, see the IBM Support page. You should also refer to
this page before opening a PMR because it contains documents that
can save you time gathering information needed to resolve a problem.
For current information available from IBM Support
on known problems and their resolution, see the IBM i software page. You should also refer
to this page before opening a PMR because it contains information
about the documents that you have to gather and send to IBM to receive
help with a problem.