There are four types of error that can occur in service integration:
errors that a messaging engine can recover from while it is running, errors
that can be resolved by an automatic restart of the messaging engine, errors
that require a user to intervene, and errors that are not detectable in the
messaging engine.
Errors that a messaging engine can recover from while it is
running
These recoverable errors can be rectified by the system
without restarting or failing over the messaging engine. In this situation,
the system automatically corrects the error. The system also adds an entry
to the system error log that explains the error and suggests any user actions.
The messaging engine continues to run and to honor the quality of service
specified for the messages it is processing.
Errors that can be resolved by an automatic restart of the
messaging engine (local errors)
A local error can be resolved by restarting
the messaging engine, either on its current server or on an alternative server.
For example, if a messaging engine cannot connect to its data store, it might
be that the server in which it is running cannot create a connection. However,
another server in the same cluster might still have access. The HAManager will
fail over the messaging engine and shut down the server on which it was running.
If the type of deployment that has been configured does not have failover
capability, for example, if there is only one server rather than a cluster,
the server is shut down and the messaging engine is restarted only after the
server is restarted.
Errors that require the user's intervention (global errors)
A
global error cannot be fixed by restarting or failing over the messaging engine.
For example, if a messaging engine's data store becomes corrupted, the messaging
engine cannot run on a different server because it will encounter the same
problem. If a messaging engine in this situation were to be failed over, the
messaging engine would be continually failed over because it could not run
in any server. This would cause unwanted disruption to the cluster as servers
attempted to run the messaging engine and were shut down. To avoid such a
situation, if a global error is encountered, the messaging engine logs an
error, stops processing messages, and is not failed over. The messaging engine
cannot be restarted until you have corrected the global error condition and
restarted the server.
Error not detectable by the messaging engine
Errors
such as a thread spinning (when the thread becomes trapped in a tight loop
and no longer performs useful work), or a deadlock (when two threads are blocking
each other), may only be detectable by explicit health monitoring. The HAManager provides such monitoring,
and periodically tests the health of the messaging engine. If the HAManager detects
that the messaging engine is not able to run properly then the HAManager shuts
down the server which is hosting the messaging engine. If the server was in
a cluster the messaging engine will be restarted on an alternative server,
if its policy allows. The shut down server will be restarted by the node agent.
If the server was not in a cluster the server must be restarted, then the
messaging engine will restart on that server.