Managing failures in the IBM WebSphere ICS system consists of using troubleshooting resources to resolve problems. Critical errors that may cause events to fail can occur with a system component such as a connector or collaboration, or a third party component such as an integrated application.
When an error causes an event to fail to process properly, the WebSphere ICS system has built-in capabilities that let you resolve the problems. The WebSphere ICS system can be set up to pause collaborations within InterChange Server if a failure occurs. This section covers the following topics:
"Failure recovery for service calls"
"Strategies for InterChange Server recovery"
"Lost connection to application"
"Unknown connector agent status"
"Database connection failures"
To avoid sending duplicate events to destination application, you may want to prevent a recovery from automatically resubmitting all service calls that were in transit when a failure occurred. To do so, prior to the server failure, you can configure a nontransactional collaboration to persist any service call event in the In-Transit state when a failure and recovery occurs. When InterChange Server recovers, the service call events remain in the In-Transit state, and you can use the Unresolved Flows dialog to examine individual failed events and control when (or if) they are resubmitted.
To configure a collaboration to persist a failed service call in-Transit state, set the Persist Service Call In Transit State checkbox of the Collaboration General Properties window.
If InterChange Server (ICS) fails while processing events, all the events currently in the Work-in-Progress (WIP) queue need to be either recovered or otherwise dealt with when the server reboots. Potentially, because of memory requirements, the recovery of the WIP events can slow or even halt the server reboot. The IBM WebSphere ICS product provides two features--deferred recovery and asnynchronous recovery--for improving the time it takes the server to reboot and for making the server available for other work before all events have been recovered.
In release 4.2 of the product, two new features were added that can assist in the efficiency of deferred recovery and asynchronous recovery: flow control and storing business object keys as part of the WIP data. Both features reduce the amount of memory needed during an ICS recovery, and therefore, should decrease significantly the amount of time necessary for ICS to reboot during a recovery.
Storing business object keys as part of the WIP data means that during recovery, the business object key is retrieved without deserializing the business object, thus preventing the need to do an MQ or a database round-trip. Flow control is a service that allows you to configure either system-wide or component-level queue depth parameters in an effort to control the memory demands on ICS.
This section covers the following topics:
"Deferred recovery of collaboration events"
In deferred recovery, recovery of a collaboration's WIP events is deferred until after the server has rebooted, thereby saving the memory usage associated with those events.
After the server has rebooted, you can resubmit the events manually. Note the following recommendations:
You establish deferred recovery by setting the RECOVERY_MODE property of a collaboration object.
The RECOVERY_MODE property has two settings, which have these behaviors when a server failure and reboot occurs:
Events that were in the Working state before the server failure will be transitioned to Deferred. No events for this collaboration will be recovered until you resubmit them manually.
Events that were in the Working state before the server failure will be recovered. If there are any existing events that were previously transitioned to the deferred state, they will remain deferred until you resubmit them.
The default setting is Always.
Figure 14. Properties dialog box, Collaboration General Properties tab
The collaboration will recover all WIP events whose state is WORKING and which it owns at the time of server boot.
The collaboration will change the WIP event states to a deferred recovery state. You will then need to handle those events at a later time using the Flow Manager. For instructions on using Flow Manager, see System Administration Guide.
InterChange Server does not wait for collaborations and connectors to recover before it completes boot-up; collaborations and connectors are allowed to recover asynchronously after InterChange Server has booted. This makes it possible to use troubleshooting tools, such as System Monitor and Flow Manager, while the connectors and collaborations are still recovering.
Critical errors in the WebSphere ICS system can cause problems in your runtime environment. A critical error as defined in the WebSphere ICS system can be generated by one of the following situations:
By default, a collaboration continues processing subsequent initiators after a flow has failed. However, a collaboration's behavior can be configured to pause automatically when a critical error occurs that could cause flows to fail. Configuring a collaboration in this way eliminates the possibility of the next flow failing for the same reason by not processing any more initiators after a flow fails. This is critical if the sequence in which initiators are processed needs to be maintained. If the collaboration pauses, the order in which initiators arrived to the server is maintained. At this point, you can fix the critical error, resolve the failed flow, then restart the collaboration. If collaborations are not dependent on an initiator that is associated with a failed flow, you can choose to resume the collaboration and resolve the failed flow at a later time. See "Failed flows" for more information on submitting failed events.
Figure 15. Properties dialog box showing "Pause when critical error occurs" option
To configure a collaboration object to pause after a critical error occurs, put a check in the "Pause when critical error occurs" box in the Collaboration General Properties tab of the Properties dialog box.
If this value is set, the collaboration will pause when a critical error occurs and will remain paused until either of the following occurs:
If you do not configure the collaboration to pause when a critical error occurs, the following situation might happen:
Two initiators, E1 and E2, are waiting to be processed by a collaboration. E1 creates a new customer and E2 updates E1. Since E2 updates E1, E1 must process before E2. If a critical error occurs while a collaboration is processing E1 and E1 fails as a result, then E1 is moved to the resubmission queue. If the Pause when critical error occurs box is not checked, the collaboration attempts to process E2. E2 fails because it relies on the successful processing of E1.
If the collaboration property CONVERT_UPDATE is set to true, then E2, which updates E1, becomes a create and creates the new customer with the updated data. Data in E1 is now old and should not be manually submitted because it will overwrite data delivered by E2.
Collaborations that are running assume that connectors have live connections to their applications. If a connector's application becomes unavailable, the connector is unable to poll the application for events and to satisfy collaboration requests.
When an application is unavailable, a connector that polls that application for events generates an error at each polling attempt. If the connector determines that the connection with the application has been lost, the connector agent terminates and returns a status to the connector controller requesting that the connector controller also terminate.
If a collaboration sends a request to a connector while the connector is up but its application is down, the request returns with a failure status to the collaboration. This happens only if the connector property ControlStoreAndForwardMode is set to false. The collaboration fails, logging one of the following messages: 17050, 17058, 17059, or 17060. If you receive such messages, check the status of the application.
The status of the connector agent is crucial to the WebSphere ICS system because it is a starting point for application events that processes. A connector controller maintains the status of its connector agent and relays this information to System Manager.
The connector controller maintains the status of its connector agent by sending response requests to the connector agent at 15-second intervals.
If the connector agent does not respond after three consecutive checks, its status is assumed to be unknown. An unknown connector agent status might mean that the connector agent is down or if the connector agent is installed across the network, the network connection might be down.
Setting the ControllerStoreAndForwardMode property for the connector to true makes the connector controller wait for the connector agent to come up before delivering any pending events. Setting this property to false makes the connector controller fail collaboration requests. The failed collaboration requests are moved to the resubmission queue and can be resubmitted using Flow Manager. See "Failed flows" for more information.
When the Pause when critical error occurs box is checked for the associated collaborations, the collaborations bound to this connector pause upon receiving the unknown status of the connector agent. An error message is logged and e-mail is sent if the e-mail connector is configured.
When InterChange Server needs a database connection for one of its services but finds that the maximum number of connections are already in use, the server tries to free a connection that is idle. If the server is unsuccessful, the connection attempt fails and InterChange Server logs error 5010: Unable to find an available connection in the cache. The maximum number of connections max-connections-value has been reached.
If you set a constraint on the number of InterChange connections by setting the MAX_CONNECTIONS parameter, you should monitor error 5010 messages because a connection failure can have undesirable consequences. For example, when InterChange Server cannot obtain a connection for its event management service, it stops running. By default, this constraint is set to an unlimited number of connections.
Connection failures indicate that the maximum number of allocated connections is insufficient to meet the runtime work load. If you cannot allocate more connections to InterChange Server in the current database, consider partitioning its work load across multiple databases.
At times, the WebSphere ICS system or its associated applications may fail. Successfully processing flows that carry data through the WebSphere ICS system is critical, so in a runtime environment it is critical to maintain data consistency. System failures such as system errors, data errors, and critical errors can cause flows to fail to process. The WebSphere ICS system has some built-in capabilities that allow you to handle system failures. The following topics describe two different types of failures:
"Failed transactional collaborations"
For information on managing, resolving, and preventing flow failures, see System Administration Guide.
A system configuration, object definition, application-specific, or data consistency error can cause a flow to fail when the WebSphere ICS system is processing that flow. Improperly functioning InterChange Server components, such as business object mapping failures, or the unavailability of a connector, can generate system errors, which cause flows to fail. Data inconsistencies, such as an isolation violation of application data during execution of a collaboration, generate data errors, which also cause flows to fail.
If an error occurs while a connector controller or a collaboration is processing a flow, the flow fails and is moved to the event resubmission queue. From here, you have the following choices:
For instructions on resolving failed flows, see System Administration Guide.
System and data errors can cause a transactional collaboration to fail. When one of these errors occurs, the collaboration attempts a rollback. If the rollback of a collaboration's compensation steps fails, the collaboration is in an "in-doubt" state. If an error occurs during runtime recovery, the collaboration is put into a list of failed transactional collaborations owned by the corresponding collaboration. A failed transactional collaboration is a collaboration whose compensation steps failed to roll back.
Once a transactional collaboration fails, you need to resolve it. You can handle a failed transactional collaboration by using Flow Manager. For instructions on resolving failed transactional collaborations, see System Administration Guide.
The default behavior for a failed transactional collaboration is to pause. You can prevent failed transactional collaborations from pausing by adding a property called PAUSE_ON_COMPENSATION_FAILURE to the collaboration template and changing the setting from TRUE (default) to FALSE. To add the new property and change the setting to FALSE, do the following: