|
Problem(Abstract) |
Ensuring WebSphere® for z/OS® Application Server
reliability and availability when Timeout ABENDs occur |
|
|
|
Resolving the
problem |
WebSphere Application Servers can encounter problems
during normal operation. Some problems are caused by faulty applications.
Others are caused by components that WebSphere interacts with such as
DB2®, WSAM, Java™ and RRS.
This technote describes errors that might disrupt the High Availability
characteristic of WebSphere for z/OS servers. Timeout ABENDs are reviewed
in some detail because
- they account for a substantial portion of all servant
region failures.
- they generally require timeout root cause analysis, and
tuning of one sort or another to resolve.
In this article we will refer to specific WebSphere z/OS Information
Center articles on High Availability and Timeouts. You can find the most
up-to-date version of the WebSphere for z/OS V5.1 Information Center at
this URL:
http://publib.boulder.ibm.com/infocenter/wasinfo/v5r1/index.jsp
From this page, look in the Contents pane on the left for the entry
"WebSphere Application Server for z/OS, Version 5.1.x". Click this entry
to expand it. Look for the entry "Troubleshooting" and expand that
section. This is where the problem determination information is anchored
within the Infocenter.
A WebSphere for z/OS Version 5.1.x server consists of a controller region
address space and one or more servant region address spaces. The servant
regions execute the customer's business logic and the controller manages
the servant regions. Servant regions can terminate due to many causes: out
of memory conditions, program checks within the JVM™ or WebSphere runtime,
certain errors in DB2 or other MVS subsystems, unexpected CANCELs of the
server from the MVS console and some hardware errors. Installations
implementing high availability (24 hours a day/7 days a week) service must
have plans in place to mitigate the loss of a servant or a controller, no
matter what the cause. Well-designed WebSphere server installations strive
to eliminate every single point of failure (SPOF). Such installations can
continue functioning if a servant (or a controller or an entire LPAR) goes
down.
The Information Center describes best practices for achieving High
Availability. Two characteristics of this best practice are:
- multiple servants per controller region
- multiple instances of the server spread across multiple distinct LPARs
(a server cluster)
Such a configuration reduces the impact of
- massive LPAR failure where the controller and all servant
regions on one machine abruptly cease to exist
- controller region ABEND that takes down all the servants
belonging to that controller
- servant region ABEND
Please refer to the following sections in the Information Center for
details on these best practices:
Of all the errors that can occur in a WebSphere server, Timeout ABENDs are
possibly the most common. They can occur due to unusual delays in the
network and in backend data stores (for example, DB2). Timeouts can be
caused by locking problems within the application. They can also be caused
by infinite loops in the application. A hung or a looping work request has
the potential to hold up other work due to the application's locking
structure. If a stuck work request holds the lock for a critical resource,
then other unrelated work will also become hung. If the work request is
looping then not only will other work requests get stuck behind the
looper, the overall system performance will suffer because of the wasted
CPU time.
WebSphere for z/OS protects itself from hung and looping work requests by
implementing timers to keep track of how long various kinds of work have
been under dispatch. These timers are discussed in the Infocenter. Please
see the Infocenter article Understanding
how timers work
The controller keeps track of how long each request has been running in a
servant. When a work request does not complete in the configured amount of
time, the request is said to have timed out. When a timeout occurs,
WebSphere typically reports the timeout to the requestor and terminates
the servant address space. When the servant region ends then the other
work requests present in that servant will also end. Servant region
termination effectively cleans up the resources held by the stuck work
request and notifies the request invokers that the requests have failed.
Typically when a servant region times out, an EC3 ABEND with reason code
04130002 through 04130008 occurs and is accompanied by an SVCDUMP. If DAE
is active on the system then it is possible that MVS will suppress the
dump, if an SVCDUMP for the same symptoms has already been taken. Analysis
of the dump is important. With it, the cause of the timeout can often be
determined. Knowing what caused the timeout generally points out what
needs to change in order to avoid that timeout.
The Information Center has several article discussing the possible causes
and fixes for timeout conditions. Please refer to these articles:
It is possible to stop Timeout ABENDs from happening for HTTP timeouts by
setting the variable
protocol_https_timeout_output_recovery=session.
This setting should be used with great care. When the timeout is detected,
the requestor is notified that the request has timed out, but the request
is allowed to remain executing on the thread until it completes or the
server is stopped. Since the servant region is not ended, the hung or
looping request could continue to tie up resources a very long time. See
the article Controlling
behavior through timeout values for details.
It is also possible to set a timeout variable to zero - in which case that
timeout will never occur. This is not recommended because once a thread in
a servant gets hung, it can no longer accept new work. If all the threads
in a servant get hung then the servant become useless. If the server is
only allowed one servant region then an outage occurs because no work will
progress through the server. |
|
|
|
|
|
|