Automatic failure recovery provides a way for the system to automatically restart critical system services and enables you to customize application (service) error handling for each of your applications. Symphony handles a number of failure recovery scenarios.
Application isolation—failure of one application does not affect any other applications, and failure or unavailability of a resource management (EGO) component has no impact on running workload.
Fault tolerant tasks—with recoverable workload configured, automated failover and data persistence ensures that running workload submitted by an application client continues to run without user intervention when system processes or hosts fail.
Cluster reliability—master host failover and automatic restart of critical system services ensures high resource availability.
Changing any of these attributes could affect session manager failover. For detailed descriptions of these attributes, see the Platform Symphony Reference.
The master candidate list defines which hosts are master candidates. By default, the list includes just one host, the master host, and there is no failover. If you configure additional candidates to enable failover, the master host is first in the list. If the master host becomes unavailable, the next host becomes the master, and so on down the list.
For master candidate failover to work properly, the master candidates must share a file system that must always be available.
If you have configured at least one management host for your cluster in addition to the master host but have not selected any failover candidates, the Platform Management Console dashboard displays a reminder message in red with a link to the page from which you define the master candidate list.
Automatic failure recovery behavior depends on which process fails or becomes unavailable, and on the type of host on which the process runs.
You can define the actions retry or fail for the SessionEnter, SessionUpdate, and Invoke methods. |
If blockHost is defined as the actionOnSI for a service instance exit, timeout, exception, or control code, the system terminates the running service instance on this host and does not use this host to start any other service instance for the application. If restart is defined as the actionOnSI, the service instance tries to restart on the original host. |
You can define the following actions for service instances based on specific states of the service lifecycle: keepAlive, restartService or blockHost. The session manager will continue to run the service, restart the service on the same host, or—through communications with the Virtual Execution Machine Kernel Daemon (vemkd)—block the host for use by the application associated with the service. |
|
The session manager requeues and reruns tasks for the session that was running on the service instance manager; no workload is lost. |
If blockHostOnTimeout= "true" in the SOAM > SIM section of the application profile and if, after a service instance manager is started, the service instance manager process cannot contact the session manager within the startUpTimeout, the system does not use this host to start any other service instance managers for the application. If blockHostOnTimeout= "false", the system tries again to start the service instance manager on the original host. |
If the service instance manager dies after starting successfully, the associated service instance exits. The session manager then restarts the service instance manager. |
|
For recoverable sessions, the session manager persists the information needed to resume the workload without loss of data, and session manager failover or recovery is transparent to the client application. For non-recoverable sessions, the workload is lost and the client must resubmit the workload. |
When it restarts, the session manager re-registers with the resource management component (EGO) and obtains a list of resources that were previously allocated to the session manager. The session manager stops and restarts all running service instance managers on those resources. |
The service instance managers associated with the failed session manager also die, and requests from the Platform Management Console and command line interface fail. The session director restarts the session manager. On restart, the session manager reads only the task and session control objects, not the input/output messages; the session manager reads those messages as required when dispatching a task. Session manager monitoring information resets; the following statistical values apply to the time period that begins with session manager restart.
|
|
Session director failure has no impact on running workload; the session manager handles workload execution. For new workload, clients submitting workload wait momentarily for the EGO service controller to restart the session director. |
Session director failure has no impact on resource allocation. The session director saves information about the resources it uses and, after restart, uses the same resources. |
While the session director is down momentarily, requests from the Platform Management Console and command line interface fail. If you set view preferences for the dashboard to automatically refresh, the request succeeds once the session director has restarted. When the session director is unavailable, clients cannot create new SDK connections.
The EGO service controller usually restarts the session director within a few seconds on the original host or on a new host if the original host has no available resources. The EGO service controller tries up to 10 times to restart the session director before setting the status to ERROR. |
|
Repository service failure has no effect on running workload. New workload that needs to download a service package must wait until the repository service becomes available. |
Repository service failure has no effect on resource allocation. |
The EGO service controller restarts the repository service on the original host or on a new host if the original host has no available resources. The EGO service controller tries up to 10 times to restart the repository service before setting the status to ERROR. |
|
Web service manager failure has no effect on resource allocation. |
The EGO service controller restarts the Web service manager on the original host or on a new host if the original host has no available resources. The EGO service controller tries up to 10 times to restart the Web service manager before setting the status to ERROR. The web service manager monitors the java process of TOMCAT—a key component of the Platform Management Console—and restarts the java process if it goes down. |
||
Loader controller failure has no effect on resource allocation. |
If the loader controller becomes unavailable, the Platform Enterprise Reporting Framework cannot collect sampling data for reporting purposes. The EGO service controller restarts the loader controller on the original host or on a new host if the original host has no available resources.The EGO service controller tries up to 10 times to restart the loader controller before setting the status to ERROR. |
||
If the data purger becomes unavailable, the database could temporarily grow until the data purger recovers and can once again purge the data. The time it takes for the database to run out of space depends on the size of your system. The EGO service controller restarts the data purger on the original host or on a new host if the original host has no available resources.The EGO service controller tries up to 10 times to restart the data purger before setting the status to ERROR. |
|||
Master load information manager failure has no effect on running workload. Clients submitting new workload receive an exception. |
The system considers the master host unavailable and a master candidate takes over as master host. During failover to the master candidate, the system does not respond to resource allocation requests. |
If no master candidate is available, the cluster is down. The system cannot restart the master load information manager; you can manually restart it, however, using the egosh ego start all command. |
|
Virtual Execution Machine Kernel Daemon failure has no effect on running workload. Clients submitting new workload receive an exception. |
During failure recovery, the system does not respond to resource allocation requests. |
The master load information manager restarts the Virtual Execution Machine Kernel Daemon. |
|
Process execution monitor failure has no effect on running workload. |
Process execution monitor failure has no effect on resource allocation. |
The load information manager restarts the process execution monitor on a compute or management host. The master load information manager restarts the process execution monitor on the master host. |
|
EGO service controller failure has no effect on running workload. |
EGO service controller failure has no effect on resource allocation. |
The Virtual Execution Machine Kernel Daemon restarts the EGO service controller. |
|
The system considers the host unavailable and terminates workload on the unavailable host. EGO notifies the SOAM component (session director or session manager) that has been allocated to the unavailable host. The session director or session manager stops the service (service instance and service instance manager) on that host and requests another resource. |
The system does not allocate any resources on the unavailable host. |
The master load information manager restarts the load information manager on the compute or management host. |
The majority of the time required for failover of compute, management, and master hosts is used to confirm that the host is actually unavailable. This prevents temporary network delays or instability from triggering frequent and unnecessary host switches.
|
|
|
|
||
|
No actions required. For recoverable sessions, session manager failover or recovery is transparent to the application client.
You can monitor automatic failure recovery through the Platform Management Console and from the command line. You can also set up SNMP traps to capture system events.
|
|||