You can look for the following problems when health management
is not working, or not working the way you expect.
Finding the right logs
The health controller is
a distributed resource that is managed by the high availability (HA) manager.
It exists within all node agent and deployment manager processes and is active
within one of these processes. If a process fails, the controller becomes
active on another node agent or deployment manager process.
You
can use the Runtime Topology function in the administrative console to locate
the active health controller instance. Click Runtime Operations > Runtime
Topology and look for the red cross icon on the Runtime Topology panel.
If node groups are configured, select them and the unassigned nodes from
the second menu. The health management log messages are displayed in the node
agent log on the node with the red cross icon.
Performance advisor is enabled with the default memory leak
health policy
The default memory leak health policy uses the performance
advisor functionality, so the performance advisor is enabled when this policy
has members assigned. To disable the performance advisor, you must remove
this health policy or narrow the membership of the health policy. To preserve
the health policy for future use, consider keeping the default memory leak
policy, but removing all of the members. To change the members, click Operational
policies > Health policies > Default_Memory_Leak. You can edit
the health policy memberships by adding and removing specific members from
the policy.
Health controller settings
The following list contains
issues that are encountered as a result of the health controller settings:
- Health controller is disabled
- Verify the setting in the administrative console by clicking Operational
policies > Autonomic controllers > Health controller and select both the
Configuration and Runtime tabs. The health controller is enabled by default.
- No health controller icon in the Runtime Topology panel
- Determine if the health controller is running by running the wsadmin checkHmmLocation.jacl script
, which is located in the install_root/bin directory
of non-deployment manager nodes. This script displays the current location
of the controller, if it is running. See checkHmmLocation.jacl
for
more information. Also, try the Force Data Update option on the runtime
topology page to try and get the health controller icon to display.
- Restarts are prohibited at this time
- Verify the prohibited restart times in the administrative console by clicking Operational
policies > Autonomic controllers > Health controller and by selecting
the Prohibited restart field. By default, no times are prohibited.
- Restarting too soon after the previous restart
- To check the minimum restart interval in the administrative console, click Operational
policies > Autonomic controllers > Health controller modify the Minimum
Restart Interval field. No minimum interval is defined by default.
- Control cycle is too long
- To check the control cycle length in the administrative console, click Operational
policies > Autonomic controllers > Health controller and adjust the value
if necessary. The health controller checks for policy violations periodically.
If its control cycle length is too long, it might not restart servers quickly
enough.
- The server has been restarted X times consecutively, and the health
condition continues to be violated
In this case,
X indicates the maximum consecutive restart parameter
of the health controller. The health controller concludes that restarts are
not fixing the problem, and disables the restarts for the server. The following
message displays in the log:
WXDH0011W: Server servername exceeded maximum verification failures: disabling restarts.
The health controller continues to monitor the server and displays
messages in the log if the health policy is violated:
WXDH0012W: Server servername with restarts disabled failed health check.
You can enable restarts for the server by performing any of
the following actions:
Health policy settings
The following issues are encountered
as a result of the health policy settings:
- The server is not part of a health policy
- Verify that the health policy memberships apply to your server in the
administrative console by clicking Operational policies > Health policies.
- The reaction mode of a policy containing the server is supervised
- Check the administrative console by clicking Runtime Operations >
Task Management > Runtime tasks to find approval requests for a restart
action for a policy in Supervised mode. Servers are restarted automatically
when you set Automatic as the reaction mode. The following message
is written to the log for the supervised condition:
WXDH0024I: Server server name has violated the health policy health condition, reaction mode is supervised.
- The server is a member of a static cluster and is the only cluster member
running
- The health policy does not bring down all members of a cluster at the
same time. If a cluster has one cluster member, or one cluster member is running,
then the cluster is not restarted.
- The server is a member of a dynamic cluster, the number of running instances
does not exceed the minimum value, and the placement controller is disabled
- Check the minimum number of instances required for the dynamic
cluster by clicking Servers > Dynamic clusters in the administrative
console. In this case, health management treats the dynamic cluster like a
static cluster, using the minimum number of instances parameter.
- The health controller has not received the policy
- The health controller does not run on the deployment manager where the
health policies are created. If the deployment manager is restarted after
the health controller started, the health controller might not have the new
policy.
You can alleviate this problem by performing the following steps:
- Disable the health controller. In the administrative console click Operational
policies > Autonomic managers > Health controller.
- Synchronize the configuration repositories with the back-end nodes. In
the administrative console, click System Administration > Nodes. Select
the nodes to synchronize, and click Synchronize.
- Restart the health controller. In the administrative console click Operational
policies > Autonomic managers > Health controller.
- Synchronize the configuration repositories with the back-end nodes. In
the administrative console, click System Administration > Nodes. Select
the nodes to synchronize, and click Synchronize.
Application placement controller interactions
The following
list contains issues that are encountered as a result of the health management
and application placement controller interactions:
- The server is a member of a dynamic cluster, but the placement controller
cannot be contacted
- For dynamic cluster members, health monitoring checks with the application
placement controller to determine whether a server can be restarted. If the
application placement controller is enabled, but cannot be contacted, the
following message displays in the log:
WXDH1018E: Could not contact the placement controller
Verify
that the placement controller is running. You can
locate the placement controller on one of the nodes that display in the Runtime
Topology panel or by using the checkPlacementLocation.jacl script.
- The server is a member of a dynamic cluster, the placement controller
is running, and the placement controller instructs health management not to
restart the server
- The placement controller might require the server instance to remain running.
- The server is stopped, but not started.
- In a dynamic cluster, a restart can take one of several forms:
- Restart in place (stop server, start server).
- Start a server instance on another node, and stop the failing one.
- Stop the failing server only, assuming that the remaining application
instances can satisfy demand.
The placement controller determines which form a restart takes, and
if necessary, where to start the new instance. After a restart is performed
in a dynamic cluster, health management issues a request to the placement
controller to recompute its placement.
Node group membership settings
The
following list contains issues that are encountered as a result of the health
management and node group membership settings:
- The server is on a node that is in maintenance mode.
- Health management does not restart a server on a node in maintenance mode.
You can take a node out of maintenance mode by clicking System administration
> Nodes > node_name > Unset maintenance.
Sensor problems
The following list contains issues
that are encountered as a result of the health management and node group membership
settings:
- No sensor data is received for the server.
- Health management cannot detect a policy violation if it receives no data
from the sensors that are required by the policy. If sensor data is not received
during the control cycle, health management prints the following log message:
WXDH3001E: No sensor data received during control cycle from server server_name for health class healthpolicy.
For response
time conditions, health management receives data from the on demand router
(ODR). No data is generated for these conditions until requests are sent
through the ODR.