WebSphere Extended Deployment, Version 6.0.x
             Operating Systems: AIX, HP-UX, Linux, Solaris, Windows, z/OS


Troubleshooting health management

This topic describes some problems to look for when health management is not working, or not working the way you expect.

Finding the right logs

The health management controller runs as part of the node agent on non-deployment manager nodes.

You can use the Runtime Topology function in the administrative console to locate the active health controller instance. Click Runtime Operations > Runtime Topology and look for the red cross icon on the Runtime Topology panel. If node groups are configured, select them and the unassigned nodes from the second menu. The health management log messages are displayed in the node agent log on the node with the red cross icon.

Health controller settings

The following list contains issues that are encountered as a result of the health controller settings:
Health management controller is disabled
Verify the setting in the administrative console by clicking Operational policies > Autonomic controllers > Health controller and select both the Configuration and Runtime tabs. The health management controller is enabled by default.
No health controller icon in the Runtime Topology panel
Determine if the health management controller is running by running the wsadmin checkHmmLocation.jacl script , which is located in the install_root/bin directory of non-deployment manager nodes. This script displays the current location of the controller, if it is running. See Locating the health management controller with scripts for more information. Also, try the Force Data Update option on the runtime topology page to try and get the health controller icon to display.
Restarts are prohibited at this time
Verify the prohibited restart times in the administrative console by clicking Operational policies > Autonomic controllers > Health controller and by selecting the Prohibited restart field. By default, no times are prohibited.
Restarting too soon after the previous restart
To check the minimum restart interval in the administrative console, click Operational policies > Autonomic controllers > Health controller modify the Minimum Restart Interval field. No minimum interval is defined by default.
Control cycle is too long
To check the control cycle length in the administrative console, click Operational policies > Autonomic controllers > Health controller and adjust the value if necessary. The health controller checks for policy violations periodically. If its control cycle length is too long, it might not restart servers quickly enough.
The server has been restarted X times consecutively, and the health condition continues to be violated
In this case, X indicates the maximum consecutive restart parameter of the health controller. The health management controller concludes that restarts are not fixing the problem, and disables the restarts for the server. The following message displays in the log:

WXDH0011W: Server servername  exceeded maximum verification failures: disabling restarts.

The health management controller continues to monitor the server and displays messages in the log if the health policy is violated:

WXDH0012W: Server servername with restarts disabled failed health check.

You can enable restarts for the server by performing any of the following actions:
  • Disable and then enable the health management controller.
  • Adjust the Maximum Consecutive Restarts controller setting.
  • Run the following command from the prompt:

    wsadmin -profile HmmControllerProcs.jacl enableServer servername

    This script is available in the <install_root>\bin directory on the non-deployment manager nodes. This script requires a running deployment manager.

Health policy settings

The following issues are encountered as a result of the health policy settings:
The server is not part of a health policy
Verify that the health policy memberships apply to your server in the administrative console by clicking Operational policies > Health policies.
The reaction mode of a policy containing the server is supervised
Check the administrative console by clicking Runtime Operations > Task Management > Runtime tasks to find approval requests for a restart action for a policy in Supervised mode. Servers are restarted automatically when you set Automatic as the reaction mode. The following message is written to the log for the supervised condition:
WXDH0024I: Server server name has violated the health policy health condition, reaction mode is supervised.
The server is a member of a static cluster and is the only cluster member running
The health policy does not bring down all members of a cluster at the same time. If a cluster has one cluster member, or one cluster member is running, then the cluster is not restarted.
The server is a member of a dynamic cluster, the number of running instances does not exceed the minimum value, and the placement controller is disabled
Check the minimum number of instances required for the dynamic cluster by clicking Servers > Dynamic clusters in the administrative console. In this case, health management treats the dynamic cluster like a static cluster, using the minimum number of instances parameter.
The health management controller has not received the policy
The health management controller does not run on the deployment manager where the health policies are created. If the deployment manager is restarted after the health management controller started, the health management controller might not have the new policy.
You can alleviate this problem by performing the following steps:
  1. Disable the health management controller. In the administrative console click Operational policies > Autonomic managers > Health controller.
  2. Synchronize the configuration repositories with the back-end nodes. In the administrative console, click System Administration > Nodes. Select the nodes to synchronize, and click Synchronize.
  3. Restart the health management controller. In the administrative console click Operational policies > Autonomic managers > Health controller.
  4. Synchronize the configuration repositories with the back-end nodes. In the administrative console, click System Administration > Nodes. Select the nodes to synchronize, and click Synchronize.

Placement controller interactions

The following list contains issues that are encountered as a result of the health management and placement controller interactions:
The server is a member of a dynamic cluster, but the placement controller cannot be contacted
For dynamic cluster members, health monitoring checks with the placement controller to determine whether a server can be restarted. If the placement controller is enabled, but cannot be contacted, the following message displays in the log:

WXDH1018E: Could not contact the placement controller

Verify that the placement controller is running. You can locate the placement controller on one of the nodes that display in the Runtime Topology panel or by using the checkPlacementLocation.jacl script.
The server is a member of a dynamic cluster, the placement controller is running, and the placement controller instructs health management not to restart the server
The placement controller might require the server instance to remain running.
The server is stopped, but not started.
In a dynamic cluster, a restart can take one of several forms:
  • Restart in place (stop server, start server).
  • Start a server instance on another node, and stop the failing one.
  • Stop the failing server only, assuming that the remaining application instances can satisfy demand.

The placement controller determines which form a restart takes, and if necessary, where to start the new instance. After a restart is performed in a dynamic cluster, health management issues a request to the placement controller to recompute its placement.

Node group membership settings

The following list contains issues that are encountered as a result of the health management and node group membership settings:
The server is on a node that is in maintenance mode.
Health management does not restart a server on a node in maintenance mode. You can take a node out of maintenance mode by clicking System administration > Nodes > select a node > Unset maintenance.

Sensor problems

The following list contains issues that are encountered as a result of the health management and node group membership settings:
No sensor data is received for the server.
Health management cannot detect a policy violation if it receives no data from the sensors that are required by the policy. If sensor data is not received during the control cycle, health management prints the following log message:

WXDH3001E: No sensor data received during control cycle from server servername for health class healthpolicy.

For response time conditions, health management receives data from the on demand router (ODR). No data is generated for these conditions until requests are sent through the ODR.



Related tasks
Configuring health management
Reference topic    

Terms of Use | Feedback

Last updated: Nov 30, 2007 3:59:37 PM EST
http://publib.boulder.ibm.com/infocenter/wxdinfo/v6r0/index.jsp?topic=/com.ibm.websphere.xd.doc/info/odoe_task/rodhealthfail.html