The high availability of the transaction service enables any server in a cluster to recover the transactional work for any other server in the same cluster. This facility forms part of the overall WebSphere Application Server high availability (HA) strategy.
As a vital part of providing recovery for transactions, the transaction service logs information about active transactional work in the transaction recovery log. The transaction recovery log stores the information in a persistent form, which means that any transactional work in progress at the time of a server failure can be resolved when the server is restarted. This activity is known as transaction recovery processing. In addition to completing outstanding transactions, this processing also ensures that any locks held in the associated resource managers are released.
Peer recovery processing
The standard recovery process performed when an application server restarts is for the server to retrieve and process the logged transaction information, to recover transactional work and complete in-doubt transactions. Completion of the transactional work (and hence the release of any database locks held by the transactions) takes place once the server has successfully restarted and processed its transaction logs. If the server is slow to recover or requires manual intervention, the transactional work cannot be completed and access to associated databases is disrupted.
To minimize such disruption to transactional work and the associated databases, WebSphere Application Server provides a high availability strategy known as transaction peer recovery.
Peer recovery is provided within a server cluster. A peer server (another cluster member) can process the recovery logs of a failed server while the peer continues to manage its own transactional workload. You do not have to wait for the failed server to restart, or start a new application server specifically to recover the failed server.
The peer recovery process is the logical equivalent to restarting the failed server, but does not constitute a complete restart of the failed server within the peer server. It merely provides an opportunity for outstanding work to be completed. It is not possible for the peer recovery process to start new work beyond recovery processing. In other words, no "forward processing" is possible for the failed server.
Peer recovery moves the high availability requirements away from individual servers and onto the server cluster. After such failures, the management system of the cluster dispatches new work onto the remaining servers, the only difference being the potential drop in overall system throughput. If a server fails, all that is required is to tidy up work that was active on the failed server and redirect requests to an alternate server.
Peer recovery example
The following diagrams illustrate the peer recovery process that takes place if a single server fails. Figure 2 shows three stable servers running in a WebSphere Application Server cluster. The workload is balanced between these servers which results in locks being held by the backend database on behalf of each of them.
Figure 3 shows the state of the system after server 1 has failed without clearing locks from the database. Servers 2 and 3 are able to run their existing transactions to completion and release existing locks in the backend database, but further access may be impaired because of the locks still held on behalf of server 1. In practice, some level of access by servers 2 and 3 should still be possible, assuming appropriately configured lock granularity, but for this example assume that servers 2 and 3 have attempted to access locked records and become blocked.
Figure 4 shows a peer recovery process for server 1 running inside server 3. The transaction service portion of the recovery process retrieves the information persisted by server 1, and uses that information to complete any in-doubt transactions. In this figure, the peer recovery process is partially complete as some locks are still held by the database on behalf of server 1.
Figure 5 shows the state of the server cluster when the peer recovery process has completed. The system is in a stable state with just two servers, between which the workload is balanced. Server 1 can be restarted at some time in the future, when it will have no recovery processing of its own to perform.
Related concepts
Transaction support in WebSphere Application Server
Related tasks
Configuring transaction properties for peer recovery