Information about the Workload Management Service
The Workload Management (WLM) component in IBM WebSphere Application
Server provides routing services for incoming application requests so that
these can be distributed to application server resources like Enterprise
Java™ Beans (EJBs), Servlets and other server side application resources
capable of processing such requests. WLM also provides failover
capabilities when the servers hosting the applications are members of a
cluster.
More details about Workload Management are available from the WebSphere
Application Server V5.1 Information Center under the following topics:
What symptoms are you experiencing?
- Application requests fail to get serviced and runtime exceptions
are seen in the SystemOut logs
- org.omg.CORBA.TRANSIENT: SIGNAL_RETRY: This is a
transient exception and suggests a routing attempt by the workload
management routing service connecting to a target server.
This is the exception thrown to the client in the case where a request is
sent out to a target server and no reply ever comes back. The exception
informs the application that another target might be available to satisfy
the request, but the request could not be failed over transparently by WLM
because the completion status was not determined to be “no”. In this case
the client application needs to determine if they wish to resend the
request.
- org.omg.CORBA.NO_IMPLEMENT: This exception is
thrown when none the servers participating in the workload management of
the Enterprise JavaBeans (EJB) are available and the routing service
cannot locate a suitable target for the request.
The exception is created, for example, if the cluster is stopped or if the
application does not have a path to any of the cluster members. There are
many kinds of NO_IMPLEMENT which can be distinguished by the associated
message or minor code with the NO_IMPLEMENT exception.
- NO_IMPLEMENT: Retry Limit Reached & NO_IMPLEMENT:
Forward Limit Reached: Each of these exceptions is thrown in the case
where WLM is attempting to route a request to a server and receives an
exception which is considered retryable, or the request is being
forwarded.
In order to avoid an infinite selection loop, these exceptions will be
thrown if errors are received or forwarding is done on the same server ten
consecutive times.
- NO_IMPLEMENT: No Cluster Data: This exception,
often seen in the Node Agent, is thrown when WLM is asked to make a
selection over a cluster, but no data has been found or gathered for that
particular cluster.
This error is often seen for a short time when first requests for a
cluster are made after startup of a cell, if this is seen they can usually
be resolved by the setting of custom WLM properties to true. For
information on the custom properties Exceptions of this type which remain
persistent should be sent to IBM support.
- NO_IMPLEMENT: No Available Target: This is a more
general exception meaning that we may have some cluster data (perhaps not
all), but with whatever data we currently have available to us, WLM cannot
find a valid target for the request from that data.
It is possible that members have been marked unusable or just that we do
not have the current data necessary to route the request to the intended
resource.
- NoAvailableTargetException: This exception is
internal to IBM only, you may see it printed out in traces with the WLM
trace spec enabled, but this exception is caught and handled internal to
the WLM code.
This exception is often expected, especially in failover and startup
scenarios and if a real problem exists, it would manifest itself as one of
the NO_IMPLEMENT exceptions above.
- Enterprise JavaBean requests are not distributed to all servers
- Make sure the target servers are started. Use the Administrative
console to try starting them, or if a target server that is failing to
service requests is already started, try restarting it.
- Try accessing the enterprise bean directly on the problem server.
Perhaps the issue is not related to workload management. If this fails,
review topic Cannot
access enterprise bean from a servlet, JSP, stand-alone program , or other
client in the WebSphere Application Server V5.1 Information Center
- Check your configuration. Review the Troubleshooting
the Workload Management component topics in the WebSphere
Application Server V5.1 Information Center.
- WebSphere Application Server V6 - Make sure the server is "in view" in
reference to HAManager and the core group the server belongs to. If the
server is not in view then it may be islanded from the rest of the cell
and not seen as being available by the other servers, and subsequently the
client.
- WebSphere Application Server V6.0 and V6.1 - Make sure static routing
is not enabled, this can cause the same islanding problem as above.
- WebSphere Application Server V6.0 and V6.1 - Make sure HAManager is
enabled, disabling it also causes WLM to function improperly.
- Enterprise JavaBean requests are not distributed evenly
- Possible reasons for this behavior are:
- Improper configuration
- Environment issues such as the availability of servers or
applications.
- A large numbers of requests that involve transactional
affinity, or
- A small number of clients
- Things to consider:
- WLM sprays a variety of different requests (referred to as
WLMable requests).
If you are only tracking the spraying of a particular application request
and it is unbalanced, that does not mean that WLM is spraying improperly.
The classic example is a 2 cluster member cluster with the same server
weights and an client application which loops on 2 WLMable requests,
operationA and operationB. If a tracking system only looks at how
operationA requests are being distributed, they will all be sent to one
server. This is not a bug or problem, as all operationB requests are sent
to the other server as well. This “pattern problem” is often seen in small
test environments with only a few servers and is rarely seen in production
systems with more cluster members.
- In WebSphere Application Server V5.0 and V5.1, the
Workload Management service uses a round robin scheme to distribute
Enterprise JavaBean requests:
- To determine if the distribution of requests is not being
properly balanced, a comparison of the number of requests processed by
each cluster member with its corresponding weight would be needed. This
can be done by following the steps in the Troubleshooting
the Workload Management component topic in the Information Center.
- When the percentage of requests arriving for each member
of the cluster is consistent with their weights, then analysis of the
application is required to determine the cause for the workload being
imbalanced even when the number of requests is balanced.
- When the number of numIncomingNonWLMObjectRequests
is not balanced among the members of the cluster and is large in relation
to the numIncomingRequests, then the reason for the imbalance is
the non-distributable components installed on the members of the cluster.
A modification to the configuration will yield a more balanced
environment.
- When the number of
numIncomingStrongAffinityRequests is not balanced among the members
of the cluster and is large in relation to the numIncomingRequests,
then the reason for the imbalance is that the requests are invoked within
a transaction. These can be reduced by installing the objects involved
within a transaction within the same cluster.
- In WebSphere Application Server V6.0 and V6.1, the
Workload Management service uses a weighted proportional scheme to
distribute Enterprise JavaBean requests.
In V6.0.2, and much more so in V6.1, the WLM selection logic had certain
feedback mechanisms that can change our routing behavior on the fly. We
react to various scenarios and even server load when making routing
decisions, so it is entirely possible that WLM can function perfectly and
the requests will not be balanced exactly to the configured server
weights. For example, if a customer has a cluster with 2 machines, one is
a powerful 8-way with lots of RAM, and the other is a single processor
desktop machine, even if the server weights are set to 2 for both servers,
you could potentially see 80% or more of the request go to the 8-way
machine, simply because the desktop machine cannot keep up as fast as the
8-way.
- A failing server still receives Enterprise JavaBean requests
(failover fails)
Some possible causes are:
- The client might have been in a running transaction with
an Enterprise JavaBean on the server that went down.
This might be working as designed by letting this particular exception
flow back to the client. Since the transaction might have completed,
failing over this request to another server could result in this request
being serviced twice. This function is referred to as “Quiesce mode”.
Quiesce mode is entered when a server is asked to shut down. While in
Quiesce mode, the server will reject all incoming requests which it
determines are new work, but still allow in-flight requests to complete.
This is primarily designed to allow transaction work to finish as above to
prevent unnecessary TRANSACTION ROLLBACK exceptions, but requests other
than transaction can be allowed into a server. By default, Quiesce mode
will last for a maximum of 3 minutes (this is configurable), although it
is possible for a server to exit quiesce earlier if all registered
components agree that it is okay to do so based on their own criteria. If
a request gets rejected by a server in quiesce mode, WLM will be called
with a org.omg.CORBA.COMM_FAILURE with a completion status of “no”,
and this request will be automatically retried by WLM.
- If the requests sent to the servers come back to the
client with any other exceptions consistently, it might be that no servers
are available.
In this case, follow the resolution steps as outlined in Troubleshooting
the Workload Management component in the WebSphere Application
Server V5.1 Information Center.
- Stopped or hung servers do not share the workload after being
restored
This error occurs when the servers that were unavailable are not
recognized by the Workload Management component after they are restored.
There is an unusable interval determined by the property
com.ibm.websphere.wlm.unusable.interval during which the workload
manager waits to send to a server that has been marked unusable. By
default this is 5 minutes. You can confirm that this is the problem by
ensuring that the servers is up and running and then waiting for the
unusable interval to elapse before checking to determine whether failover
occurs. If the server still does not participate in workload, reference
bullet #2 above for additional reasons why this could occur.
- What to do Next?
If the above scenarios and steps taken did
not help resolve your problem, you may access the information available
from the WebSphere
Application Server Support site and browse for current and known
problems and their resolution.
If no useful hints are found in the support site, then see the MustGather
for WLM problems and collect the information requested before opening a
PMR.
|