You can configure the amount of time between system checks
for failed servers with the heartbeat interval setting. This setting
applies to catalog servers only.
About this task
Configuring failover varies depending on the type of environment
you are using. If you are using a stand-alone environment, you can
configure failover with the command line. If you are using a
WebSphere® Application Server Network Deployment environment,
you must configure failover in the
WebSphere Application Server Network Deployment administrative
console.
Procedure
- Configure failover for stand-alone environments.
- With the -heartbeat parameter in the startOgServer or startXsServer script when you start the catalog server.
- With the heartBeatFrequencyLevel property in the server properties
file for the catalog server.
Use one of the following values:
Table 1. Valid heartbeat
valuesValue |
Action |
Description |
-1 |
Aggressive |
Specifies an aggressive heartbeat level. With
this value, failures are detected more quickly, but additional processor
and network resources are used. This level is more sensitive to missing
heartbeats when the server is busy. Failovers are typically detected
within 5 seconds. |
-10 |
Semi-aggressive |
Failovers are typically detected within 15 seconds. |
0 |
Typical (default) |
Specifies a heartbeat level at a typical rate.
With this value, failover detection occurs at a reasonable rate without
overusing resources. Failovers are typically detected within 30 seconds. |
10 |
Semi-relaxed |
Failovers are typically detected within 90 seconds. |
1 |
Relaxed |
Specifies a relaxed heartbeat level. With this
value, a decreased heartbeat frequency increases the time to detect
failures, but also decreases processor and network use. Failovers
are typically detected within 180 seconds. |
An aggressive heartbeat interval can be useful when the
processes and network are stable. If the network or processes are
not optimally configured, heartbeats might be missed, which can result
in a false failure detection.
- Configure failover for WebSphere Application Server environments.
You can configure WebSphere Application Server Network Deployment Version 7.0 and later to allow WebSphere eXtreme Scale to fail over very quickly.
The default failover time for hard failures is approximately 200 seconds.
A hard failure is a physical computer or server crash, network cable
disconnection or operating system error. Failures because of process
crashes or soft failures typically fail over in less than one second.
Failure detection for soft failures occurs when the network sockets
from the dead process are closed automatically by the operating system
for the server hosting the process.
Core group heartbeat
configuration
WebSphere eXtreme Scale running in a WebSphere Application Server process
inherits the failover characteristics from the core group settings
of the application server. The following sections describe how to
configure the core group heartbeat settings for different versions
of WebSphere Application
Server Network Deployment:
Update the core group settings for WebSphere Application Server Network Deployment Version
7.0
WebSphere Application Server Network Deployment Version
7.0 provides two core group settings that can be adjusted to increase
or decrease failover detection:
- Heartbeat transmission period. The default is 30000 milliseconds.
- Heartbeat timeout period. The default is 180000 milliseconds.
For more details on how change these settings, see the WebSphere Application Server Network Deployment Information center: Discovery and failure detection settings.
Use the following settings to achieve a 1500 ms failure detection
time for WebSphere Application Server Network Deployment Version
7 servers:
- Set the heartbeat transmission period to 750 milliseconds.
- Set the heartbeat timeout period to 1500 milliseconds.
What to do next
When these settings are modified to provide short failover
times, there are some system-tuning issues to be aware of. First, Java™ is not a real-time environment.
It is possible for threads to be delayed if the JVM is experiencing long
garbage collection times. Threads might also be delayed if the machine
hosting the JVM is
heavily loaded (due to the JVM itself or other processes running on the machine). If threads
are delayed, heartbeats might not be sent on time. In the worst case,
they might be delayed by the required failover time. If threads are
delayed, false failure detections occur. The system must be tuned
and sized to ensure that false failure detections do not happen in
production. Adequate load testing is the best way to ensure this.
Note: The current version of eXtreme Scale supports WebSphere Real Time.