You can configure the amount of time between system checks
for failed servers with the heartbeat interval setting. This setting
applies to catalog servers only.
About this task
Configuring failover varies depending on the type of environment
you are using. If you are using a stand-alone environment, you can
configure failover with the command line. If you are using a
WebSphere® Application Server Network Deployment environment,
you must configure failover in the
WebSphere Application Server Network Deployment administrative
console.
Procedure
- Configure failover for stand-alone environments.
- With the -heartbeat parameter in the startOgServer script when you start the catalog server.
- With the heartBeatFrequencyLevel property in the server properties
file for the catalog server.
Use one of the following values:
Table 1. Valid heartbeat valuesValue |
Action |
Description |
-1 |
Aggressive |
Specifies an aggressive heartbeat level. With
this value, failures are detected more quickly, but more processor
and network resources are used. This level is more sensitive to missing
heartbeats when the server is busy. Failovers are typically detected
within 5 seconds. |
0 |
Typical (default) |
Specifies a heartbeat level at a typical rate.
With this value, failover detection occurs at a reasonable rate without
overusing resources. Failovers are typically detected within 30 seconds. |
1 |
Relaxed |
Specifies a relaxed heartbeat level. With this
value, a decreased heartbeat frequency increases the time to detect
failures, but also decreases processor and network use. Failovers
are typically detected within 180 seconds. |
An aggressive heartbeat interval can be useful when the
processes and network are stable. If the network or processes are
not optimally configured, heartbeats might be missed, which can result
in a false failure detection.
- Configure failover for WebSphere Application Server environments.
You can configure WebSphere Application Server Network Deployment Version 6.1 and later to allow WebSphere eXtreme Scale to fail over very quickly.
The default failover time for hard failures is approximately 200 seconds.
A hard failure is a physical computer or server crash, network cable
disconnection or operating system error. Failures because of process
crashes or soft failures typically fail over in less than one second.
Failure detection for soft failures occurs when the network sockets
from the dead process are closed automatically by the operating system
for the server hosting the process.
Core group heartbeat
configuration
WebSphere eXtreme Scale running in a WebSphere Application Server process inherits the failover characteristics from the core group
settings of the application server. The following sections describe
how to configure the core group heartbeat settings for different versions
of WebSphere Application
Server Network Deployment:
- Update the
core group settings for WebSphere Application Server Network Deployment Version 6.1 and 7.0:
Specify the heartbeat interval in seconds on WebSphere Application Server versions from Version
6.0 through Version 6.1.0.12 or in milliseconds starting with Version
6.1.0.13. You must also specify the number of missed heartbeats. This
value indicates how many heartbeats can be missed before a peer Java™ virtual
machine (JVM) is considered as failed.
The hard failure detection time is approximately the product of the
heartbeat interval and the number of missed heartbeats.
These
properties are specified using custom properties on the core group
using the WebSphere administrative
console. See
Core group custom properties for configuration details. These properties must be specified for all core groups used by
the application:
- The heartbeat interval is specified using either the IBM_CS_FD_PERIOD_SEC
custom property for seconds or the IBM_CS_FD_PERIOD_MILLIS custom
property for milliseconds (requires Version 6.1.0.13 or later).
- The number of missed heartbeats is specified using the IBM_CS_FD_CONSECUTIVE_MISSED
custom property.
The default value for the IBM_CS_FD_PERIOD_SEC property
is 20 and for the IBM_CS_FD_CONSECUTIVE_MISSED property is 10. If
the IBM_CS_FD_PERIOD_MILLIS property is specified, then it overrides
any of the set IBM_CS_FD_PERIOD_SEC custom properties. The values
of these properties are positive integer values.
Use the following
settings to achieve a 1500 ms failure detection time for
WebSphere Application Server Network Deployment Version
6.x servers:
- Set IBM_CS_FD_PERIOD_MILLIS = 750 (WebSphere Application Server Network Deployment V6.1.0.13
and later)
- Set IBM_CS_FD_CONSECUTIVE_MISSED = 2
Update the core group settings for WebSphere Application Server Network Deployment Version
7.0
WebSphere Application Server Network Deployment Version
7.0 provides two core group settings that can be adjusted to increase
or decrease failover detection:
- Heartbeat transmission period. The default is 30000 milliseconds.
- Heartbeat timeout period. The default is 180000 milliseconds.
For more details on how change these settings, see the WebSphere Application Server Network Deployment Information center: Discovery and failure detection settings.
Use the following settings to achieve a 1500
ms failure detection time for WebSphere Application Server Network Deployment Version 7 servers:
- Set the heartbeat transmission period to 750 milliseconds.
- Set the heartbeat timeout period to 1500 milliseconds.
What to do next
When these settings are modified to provide short failover
times, there are some system-tuning issues to be aware of. First, Java is not a real-time environment.
It is possible for threads to be delayed if the JVM is experiencing long
garbage collection times. Threads might also be delayed if the machine
hosting the JVM is
heavily loaded (due to the JVM itself or other processes running on the machine). If threads
are delayed, heartbeats might not be sent on time. In the worst case,
they might be delayed by the required failover time. If threads are
delayed, false failure detections occur. The system must be tuned
and sized to ensure that false failure detections do not happen in
production. Adequate load testing is the best way to ensure this.
Note: The current version of eXtreme Scale supports WebSphere Real Time.