Managing failures in the InterChange Server Express system consists of using troubleshooting resources to resolve problems. Critical errors that may cause events to fail can occur with a system component such as a connector or collaboration, or a third-party component such as an integrated application.
When an error causes an event to fail, the InterChange Server Express system has built-in capabilities that let you resolve the problems. The system can be set up to pause collaborations within InterChange Server Express if a failure occurs. This section covers the following topics:
"Failure recovery for service calls"
"Strategies for InterChange Server Express recovery"
"Lost connection to application"
"Unknown connector agent status"
"Database connection failures"
To avoid sending duplicate events to the destination application, you may want to prevent a recovery from automatically resubmitting all service calls that were in transit when a failure occurred. Before the server failure, you can configure a nontransactional collaboration to persist any service call event in the In-Transit state when a failure and recovery occurs. When InterChange Server Express recovers, the service call events remain in the In-Transit state, and you can use Flow Manager or Failed Event Manager to examine individual failed events and control when (or if) they are resubmitted.
To configure a collaboration to persist a failed service call in-Transit state, go to System Manager and select the Persist Service Call In Transit State check box in the Collaboration General Properties window.
If InterChange Server Express fails when processing events, all the events currently in the Work-in-Progress (WIP) queue must be either recovered or otherwise processed when the server reboots. Potentially, because of memory requirements, the recovery of the WIP events can slow or even halt the server reboot. The InterChange Server Express product provides two features--deferred recovery and asnynchronous recovery--for improving the time it takes the server to reboot and for making the server available for other work before all events have been recovered.
Flow control and storing business object keys as part of the WIP data assist in the efficiency of deferred recovery and asynchronous recovery. Both features reduce the amount of memory needed during an InterChange Server Express recovery, and therefore, can decrease significantly the amount of time necessary for InterChange Server Express to reboot during a recovery.
Storing business object keys as part of the WIP data means that during recovery, the business object key is retrieved without deserializing the business object, avoiding an MQ or a database round-trip. Flow control is a service that allows you to configure either system-wide or component-level queue depth parameters in to control the memory demands on InterChange Server Express. For more information about configuring flow control, see Steps for configuring system-wide flow control.
In deferred recovery, recovery of a collaboration's WIP events is deferred until after the server has rebooted, thereby saving the memory usage associated with those events.
After the server has rebooted, you can resubmit the events manually. Note the following recommendations:
You establish deferred recovery by setting the RECOVERY_MODE property of a collaboration object.
The RECOVERY_MODE property has two settings, which do the following when a server failure and reboot occurs:
Events that were in the Working state before the server failure are changed to the Deferred state. No events for this collaboration are recovered until you resubmit them manually.
Events that were in the Working state before the server failure are recovered. Events in the Deferred state remain deferred until you resubmit them.
The default setting is Always.
Figure 60. Properties dialog box, Collaboration General Properties tab
The collaboration recovers all WIP events whose state is Working and that it owns at the time of server boot.
The collaboration changes the WIP event to the Deferred recovery state. You must process those events at a later time using the Flow Manager or Failed Event Manager. For more information, see Working with failed events.
InterChange Server Express does not wait for collaborations and connectors to recover before it completes the boot process; collaborations and connectors are allowed to recover asynchronously after InterChange Server Express has booted. This makes it possible to use troubleshooting tools -- such as System Monitor, Failed Event Manager, and Flow Manager -- when the connectors and collaborations are recovering.
Critical errors in the InterChange Server Express system can cause problems in your run-time environment. A critical error as defined in the InterChange Server Express system can be generated by one of the following situations:
By default, a collaboration continues processing subsequent initiators after a flow has failed. However, a collaboration's behavior can be configured to pause automatically when a critical error occurs that might cause flows to fail. Configuring a collaboration in this way eliminates the possibility of the next flow failing for the same reason by not processing any more initiators after a flow fails. This is critical if the sequence in which initiators are processed needs to be maintained. If the collaboration pauses, the order in which initiators arrived to the server is maintained. At this point, you can fix the critical error, resolve the failed flow, then restart the collaboration. If collaborations do not depend on an initiator that is associated with a failed flow, you can resume the collaboration and resolve the failed flow at a later time. See Flow failures for more information on submitting failed events.
To configure a collaboration object to pause after a critical error occurs, select the Pause when critical error occurs check box in the Collaboration General Properties tab of the Properties dialog box.
If this value is set, the collaboration pauses when a critical error occurs and remains paused until either of the following occurs:
If you do not configure the collaboration to pause when a critical error occurs, the following situation might happen:
Two initiators, E1 and E2, are waiting to be processed by a collaboration. E1 creates a new customer and E2 updates E1. Because E2 updates E1, E1 must process before E2. If a critical error occurs when a collaboration is processing E1 and E1 fails as a result, then E1 is moved to the resubmission queue. If you do not select the Pause when critical error occurs check box, the collaboration attempts to process E2. E2 fails because it relies on the successful processing of E1.
If the collaboration property CONVERT_UPDATE is set to true, then E2, which updates E1, becomes a create and creates the new customer with the updated data. Data in E1 is now old and should not be manually submitted because it overwrites data delivered by E2.
Collaborations that are running assume that connectors have live connections to their applications. If a connector's application becomes unavailable, the connector is unable to poll the application for events and to satisfy collaboration requests.
When an application is unavailable, a connector that polls that application for events generates an error at each polling attempt. If the connector determines that the connection with the application has been lost, the connector agent terminates and returns a status to the connector controller requesting that the connector controller also terminate.
If a collaboration sends a request to a connector when the connector is up but its application has failed, the request returns with a failure status to the collaboration. This happens only if the connector property ControlStoreAndForwardMode is set to false. The collaboration fails, logging one of the following messages: 17050, 17058, 17059, or 17060. If you receive such messages, check the status of the application.
The status of the connector agent is crucial to the InterChange Server Express system because it is a starting point for application events. A connector controller maintains the status of its connector agent and relays this information to System Manager.
The connector controller maintains the status of its connector agent by sending response requests to the connector agent at 15-second intervals.
If the connector agent does not respond after three consecutive checks, its status is assumed to be unknown. An unknown connector agent status might mean that the connector agent has failed or if the connector agent is installed across the network, the network connection might have failed.
Setting the ControllerStoreAndForwardMode property for the connector to true makes the connector controller wait for the connector agent to start before delivering pending events. Setting this property to false makes the connector controller fail collaboration requests. The failed collaboration requests are moved to the resubmission queue and can be resubmitted using Flow Manager. See Flow failures for more information.
When the Pause when critical error occurs check box is selected for the associated collaborations, the collaborations bound to this connector pause upon receiving the unknown status of the connector agent. An error message is logged and e-mail is sent if the e-mail connector is configured.
When InterChange Server Express needs a database connection for one of its services but finds that the maximum number of connections are already in use, the server tries to free a connection that is idle. If the server is unsuccessful, the connection attempt fails and InterChange Server Express logs error 5010: Unable to find an available connection in the cache. The maximum number of connections max-connections-value has been reached.
If you set a constraint on the number of InterChange Server Express connections by setting the MAX_CONNECTIONS parameter, you should monitor error 5010 messages because a connection failure can have undesirable consequences. For example, when InterChange Server Express cannot obtain a connection for its event management service, it stops running. By default, this constraint is set to an unlimited number of connections.
Connection failures indicate that the maximum number of allocated connections is insufficient to meet the run-time work load. If you cannot allocate more connections to InterChange Server Express in the current database, consider partitioning its workload across multiple databases.
At times, the InterChange Server Express system or its associated applications may fail. Successfully processing flows that carry data through the InterChange Server Express system is critical, so in a run-time environment it is critical to maintain data consistency. System failures such as system errors, data errors, and critical errors can cause flows to fail to process. The InterChange Server Express system has built-in capabilities that allow you to process system failures.
A system configuration, object definition, application-specific, or data consistency error can cause a flow to fail when the InterChange Server Express system is processing that flow. Improperly functioning InterChange Server Express components, such as business object mapping failures or the unavailability of a connector, can generate system errors, which cause flows to fail. Data inconsistencies, such as an isolation violation of application data during execution of a collaboration, generate data errors, which also cause flows to fail.
If an error occurs when a connector controller or a collaboration is processing a flow, the flow fails and is moved to the event resubmission queue. From here, you have the following choices:
For instructions on resolving failed flows, see "Working with failed events"..
System and data errors can cause a transactional collaboration to fail. When one of these errors occurs, the collaboration attempts a rollback. If the rollback of a collaboration's compensation steps fails, the collaboration is in an "in-doubt" state. If an error occurs during run-time recovery, the collaboration is put into a list of failed transactional collaborations owned by the corresponding collaboration. A failed transactional collaboration is a collaboration whose compensation steps failed to roll back.
After a transactional collaboration fails, you must resolve it. You can process a failed transactional collaboration by using Flow Manager. For instructions on resolving failed transactional collaborations, see "Working with failed events"..
The default behavior for a failed transactional collaboration is to pause. You can prevent failed transactional collaborations from pausing by adding a property called PAUSE_ON_COMPENSATION_FAILURE to the collaboration template and changing the setting from TRUE (default) to FALSE.
Perform the following steps to add the new property and change the setting to FALSE: