WebSphere z/OS High Availability and Timeout ABENDs
 Technote (troubleshooting)
 
Problem(Abstract)
Ensuring WebSphere® for z/OS® Application Server reliability and availability when Timeout ABENDs occur
 
Resolving the problem
WebSphere Application Servers can encounter problems during normal operation. Some problems are caused by faulty applications. Others are caused by components that WebSphere interacts with such as DB2®, WSAM, Java™ and RRS.
This technote describes errors that might disrupt the High Availability characteristic of WebSphere for z/OS servers. Timeout ABENDs are reviewed in some detail because
  • they account for a substantial portion of all servant region failures.
  • they generally require timeout root cause analysis, and tuning of one sort or another to resolve.

In this article we will refer to specific WebSphere z/OS Information Center articles on High Availability and Timeouts. You can find the most up-to-date version of the WebSphere for z/OS V5.1 Information Center at this URL:

http://publib.boulder.ibm.com/infocenter/wasinfo/v5r1/index.jsp

From this page, look in the Contents pane on the left for the entry "WebSphere Application Server for z/OS, Version 5.1.x". Click this entry to expand it. Look for the entry "Troubleshooting" and expand that section. This is where the problem determination information is anchored within the Infocenter.

A WebSphere for z/OS Version 5.1.x server consists of a controller region address space and one or more servant region address spaces. The servant regions execute the customer's business logic and the controller manages the servant regions. Servant regions can terminate due to many causes: out of memory conditions, program checks within the JVM™ or WebSphere runtime, certain errors in DB2 or other MVS subsystems, unexpected CANCELs of the server from the MVS console and some hardware errors. Installations implementing high availability (24 hours a day/7 days a week) service must have plans in place to mitigate the loss of a servant or a controller, no matter what the cause. Well-designed WebSphere server installations strive to eliminate every single point of failure (SPOF). Such installations can continue functioning if a servant (or a controller or an entire LPAR) goes down.

The Information Center describes best practices for achieving High Availability. Two characteristics of this best practice are:
  1. multiple servants per controller region
  2. multiple instances of the server spread across multiple distinct LPARs (a server cluster)

Such a configuration reduces the impact of
  • massive LPAR failure where the controller and all servant regions on one machine abruptly cease to exist
  • controller region ABEND that takes down all the servants belonging to that controller
  • servant region ABEND

Please refer to the following sections in the Information Center for details on these best practices:
Of all the errors that can occur in a WebSphere server, Timeout ABENDs are possibly the most common. They can occur due to unusual delays in the network and in backend data stores (for example, DB2). Timeouts can be caused by locking problems within the application. They can also be caused by infinite loops in the application. A hung or a looping work request has the potential to hold up other work due to the application's locking structure. If a stuck work request holds the lock for a critical resource, then other unrelated work will also become hung. If the work request is looping then not only will other work requests get stuck behind the looper, the overall system performance will suffer because of the wasted CPU time.

WebSphere for z/OS protects itself from hung and looping work requests by implementing timers to keep track of how long various kinds of work have been under dispatch. These timers are discussed in the Infocenter. Please see the Infocenter article Understanding how timers work

The controller keeps track of how long each request has been running in a servant. When a work request does not complete in the configured amount of time, the request is said to have timed out. When a timeout occurs, WebSphere typically reports the timeout to the requestor and terminates the servant address space. When the servant region ends then the other work requests present in that servant will also end. Servant region termination effectively cleans up the resources held by the stuck work request and notifies the request invokers that the requests have failed.

Typically when a servant region times out, an EC3 ABEND with reason code 04130002 through 04130008 occurs and is accompanied by an SVCDUMP. If DAE is active on the system then it is possible that MVS will suppress the dump, if an SVCDUMP for the same symptoms has already been taken. Analysis of the dump is important. With it, the cause of the timeout can often be determined. Knowing what caused the timeout generally points out what needs to change in order to avoid that timeout.

The Information Center has several article discussing the possible causes and fixes for timeout conditions. Please refer to these articles:
It is possible to stop Timeout ABENDs from happening for HTTP timeouts by setting the variable protocol_https_timeout_output_recovery=session.
This setting should be used with great care. When the timeout is detected, the requestor is notified that the request has timed out, but the request is allowed to remain executing on the thread until it completes or the server is stopped. Since the servant region is not ended, the hung or looping request could continue to tie up resources a very long time. See the article Controlling behavior through timeout values for details.

It is also possible to set a timeout variable to zero - in which case that timeout will never occur. This is not recommended because once a thread in a servant gets hung, it can no longer accept new work. If all the threads in a servant get hung then the servant become useless. If the server is only allowed one servant region then an outage occurs because no work will progress through the server.
 
 
 


Document Information


Current web document: swg21215659.html
Product categories: Software > Application Servers > Distributed Application & Web Servers > WebSphere Application Server for z/OS > General
Operating system(s): z/OS
Software version: 6.0.2
Software edition:
Reference #: 1215659
IBM Group: Software Group
Modified date: Sep 8, 2005