You can look for the following problems when health management
is not working, or not working the way that you expect.
Finding the correct logs
The health controller
is a distributed resource that is managed by the high availability
(HA) manager. It exists within all node agent and deployment manager
processes and is active within one of these processes. If a process
fails, the controller becomes active on another node agent or deployment
manager process.
To determine where the health controller is
running, click in the administrative console. The location and stability
status of the health controller displays.
The performance advisor is enabled with the predefined
memory leak health policy
The predefined memory leak health
policy uses the performance advisor function, so the performance advisor
is enabled when this policy has assigned members. To disable the performance
advisor, remove this health policy or narrow the membership of the
health policy. To preserve the health policy for future use, keep
the memory leak policy, but remove all members. To change the members,
click . You can edit the health policy memberships by adding
and removing specific members.
Health controller settings
The following list
contains issues that are encountered as a result of the health controller
settings:
- Health controller is disabled
- To verify the setting in the administrative console, click and select
both the Configuration and Runtime tabs.
The health controller is enabled by default.
- Restarts are prohibited at this time
- To verify the prohibited restart times in the administrative console,
click and select the Prohibited restart field.
By default, no time values are prohibited.
- Restarting too soon after the previous restart
- To check the minimum restart interval in the administrative console,
click and modify the Minimum restart interval field.
No minimum interval is defined by default.
- Control cycle is too long
- To check the control cycle length in the administrative console,
click and adjust the value if necessary. The health controller
checks for policy violations periodically. If its control cycle length
is too long, it might not restart servers quickly enough.
- The server was restarted X times consecutively, and the
health condition continues to be violated
In this case,
X indicates the maximum consecutive restart
parameter of the health controller. The health controller concludes
that restarts are not fixing the problem, and disables the restarts
for the server. The following message displays in the log:
WXDH0011W: Server servername exceeded maximum verification failures: disabling restarts.
The
health controller continues to monitor the server and displays messages
in the log if the health policy is violated:
WXDH0012W: Server servername with restarts disabled failed health check.
You
can enable restarts for the server by performing any of the following
actions:
Health policy settings
The following issues
are encountered as a result of the health policy settings:
- The server is not part of a health policy
- To verify that the health policy memberships apply to your server
in the administrative console, click .
- The reaction mode of a policy containing the server is a supervised
mode
- To check the administrative console, click . Find approval
requests for a restart action for a policy in Supervised mode.
Servers are restarted automatically when you set the reaction mode
to Automatic. The following message is written
to the log for the supervised condition:
WXDH0024I: Server server name has violated the health policy health condition,
reaction mode is supervised.
- The server is a member of a static cluster, and it is the only
cluster member that is running
- The health policy does not bring down all members of a cluster
at the same time. If a cluster has one cluster member, or one cluster
member is running, then the cluster is not restarted.
- The server is a member of a dynamic cluster. The number of running
instances does not exceed the minimum value, and the placement controller
is disabled
- To check the minimum number of instances required for
the dynamic cluster, click in the administrative console. In this case, health
management treats the dynamic cluster like a static cluster, using
the minimum number of instances parameters.
- The health controller has not received the policy
- The health controller does not run on the deployment manager where
the health policies are created. If the deployment manager is restarted
after the health controller started, the health controller might not
have the new policy.
To resolve this problem, perform the following
steps:
- Disable the health controller. In the administrative console,
click .
- Synchronize the configuration repositories with the back-end nodes.
In the administrative console, click .
Select the nodes to synchronize, and click Synchronize.
- Restart the health controller. In the administrative console,
click .
- Synchronize the configuration repositories with the back-end nodes.
In the administrative console, click .
Select the nodes to synchronize, and click Synchronize.
Application placement controller interactions
The
following list contains issues triggered by health management and
application placement controller interactions:
- The server is a member of a dynamic cluster, but the placement
controller cannot be contacted
- For dynamic cluster members, health monitoring checks with the
application placement controller to determine whether a server can
be restarted. If the application placement controller is enabled,
but cannot be contacted, the following message displays in the log:
WXDH1018E: Could not contact the placement controller
Verify
that the placement controller is running. To determine where the
health controller is running, click in the administrative
console. The location and stability status of the health controller
displays. The health controller logs messages to the particular node
agent or deployment manager that is indicated by the current location.
- The server is stopped, but not started.
- In a dynamic cluster, a restart can take one of several forms:
Sensor problems
The following list contains
issues that are related to health management and node group membership
settings:
- No sensor data is received for the server.
- Health management cannot detect a policy violation if it receives
no data from the sensors that are required by the policy. If sensor
data is not received during the control cycle, health management prints
the following log message:
WXDH3001E: No sensor data received during control cycle from server server_name for
health class healthpolicy.
For
response time conditions, health management receives data from the
on demand router (ODR). No data is generated for these conditions
until requests are sent through the ODR.
Task management status
Sometimes
a
Restart action task status ends up in
Failed or
Unknown state.
This scenario happens when the server does not stop during the time
period that is allocated by default, or when the task times out. Use
the following cell level property to adjust the timeout for your environment:
HMM.StopServerTimeout.
The value is expressed in milliseconds, and the default value is 10000.
This property allows health management to extend the wait time for
server stop notifications that are received from the on demand configuration.
To
increase the timeout for your environment, go to . The default value is 5 minutes. The restart task starts
after twice the amount that is specified, allowing the server to stop
and start.