In a distributed data grid, partitions are distributed across multiple Java™ virtual machines (JVM). These
JVMs can be on more than one system. A transaction that writes to
multiple partitions might involve transactional decisions that affect
more than one system. When the transaction is committed with a two-phase
commit protocol, this commit process ensures that the entire transaction
is persisted, or none of transaction is persisted. The two-phase commit
process ensures this outcome despite partition, system, or communication
failures. If a failure occurs in the second phase, the WebSphere® eXtreme Scale client attempts to
resolve the failure automatically, unless the error meets certain
criteria for which you can manually intervene.
A transaction that is enabled to write to multiple partitions uses
the two-phase commit protocol. A two-phase commit protocol ensures
that the commit process is consistent across all partitions and systems. WebSphere eXtreme Scale acts as the coordinator
that controls the two-phase commit process. The partitions that are
involved in the transaction are called the participants or resource
managers (RM). During the second phase of the commit protocol, the
coordinator delegates one of the partitions to act as the transaction
manager (TM). The TM is responsible for tracking the decision of each
transaction and recovering the transaction if a failure occurs.
- First phase:
- When an application commits a transaction, WebSphere eXtreme Scale client starts the first
phase by sending a prepare to commit request to each partition identified
as an RM. Each partition applies the transaction changes to the backing
maps and holds all locks to ensure data integrity. The RM notifies WebSphere eXtreme Scale client. After all partitions
identified as an RM respond with success, WebSphere eXtreme Scale client begins the second
phase of the commit protocol.
- Second phase:
- If at least one partition fails during the first phase, then the
coordinator rolls back all partitions during the second phase. If
all RM partitions respond with success, then the WebSphere eXtreme Scale client delegates one
of the partitions to act as the TM partition. As the coordinator, WebSphere eXtreme Scale begins the second phase
of the commit protocol by sending a commit or a rollback request to
all partitions that are involved in the transaction. Each partition
that is identified as an RM then either applies or rolls back the
changes to the backing map and releases all the locks. The RM then
notifies WebSphere eXtreme Scale client.
If at least one partition failed during the second phase, then the
delegated TM partition automatically recovers the transaction. Automatic
recovery ensures all the partitions that are involved in the transaction
are consistent.
- In doubt phase:
- The indoubt phase is the period between when the RM partition
successfully processes the first phase, and is waiting to begin the
second phase. During the indoubt period, the RM partition does not
know whether to commit or roll back the transaction. The RM partition
holds onto locks. Holding the locks can result in an increase in lock
contention for other transactions.
Error recovery during a two-phase commit
If
a failure occurs during the first phase,
WebSphere eXtreme Scale client rolls back the
transaction. If one of the partitions fails to commit the transaction,
then the TM ensures that the transaction is committed by periodically
attempting to commit the transaction. An example of log messages that
occur in this scenario follow:
00000099 TransactionLog I
CWOBJ8705I: Automatic resolution of transaction WXS-40000139-DF01-216D-E002-1CB456931719
at RM:TestGrid:TestSet2:20 is still waiting for a decision. Another
attempt to resolve the transaction will occur in 30 seconds.
Allow
WebSphere eXtreme Scale client to resolve the transaction. Attempt to intervene manually
only if the transaction is not recovered within 1 minute or the application
is experiencing a high volume of lock contention because it is an
indoubt transaction. For more information about how to manually recover
a transaction, see
Troubleshooting lock timeout exceptions for a multi-partition transaction.