Replication architecture

With eXtreme Scale, an in-memory database or shard can be replicated from one Java™ virtual machine (JVM) to another. A shard represents a partition that is placed on a container. Multiple shards that represent different partitions can exist on a single container. Each partition has an instance that is a primary shard and a configurable number of replica shards. The replica shards are either synchronous or asynchronous. The types and placement of replica shards are determined by eXtreme Scale using a deployment policy, which specifies the minimum and maximum number of synchronous and asynchronous shards.

Shard types

Replication uses three types of shards:

  • Primary
  • Synchronous replica
  • Asynchronous replica

The primary shard receives all insert, update and remove operations. The primary shard adds and removes replicas, replicates data to the replicas, and manages commits and rollbacks of transactions.

Synchronous replicas maintain the same state as the primary. When a primary replicates data to a synchronous replica, the transaction is not committed until it commits on the synchronous replica.

Asynchronous replicas might or might not be at the same state as the primary. When a primary replicates data to an asynchronous replica, the primary does not wait for the asynchronous replica to commit. An asynchronous replica still maintains the order of the transactions that are sent from the primary.

Figure 1. Communication path between a primary shard and replica shards
A transaction communicates with a primary shard. The primary shard communicates with replica shards, which can exist in one or more Java Virtual Machines.

The memory cost of replication

To bring a replica online, a checkpoint map must be created to copy over existing data from the primary. However, the main memory cost on the primary is the memory for each entry that is modified after the checkpoint occurs. If the data changes a lot after the checkpoint, then this operation costs an amount of memory that is proportional to the number of modified entries. After the checkpoint is copied to the replica, the changes are merged back in to the checkpoint. The changed entries are not freed until they have been sent to all required replicas.

Minimum synchronous replica shards

When a primary prepares to commit data, it checks how many synchronous replica shards voted to commit the transaction. If the transaction processes normally on the replica, it votes to commit. If something went wrong on the synchronous replica, it votes not to commit. Before a primary commits, the number of synchronous replica shards that are voting to commit must meet the minSyncReplica setting from the deployment policy. When the number of synchronous replica shards that are voting to commit is too low, the primary does not commit the transaction and an error results. This action ensures that the required number of synchronous replicas are available with the correct data. Synchronous replicas that encountered errors reregister to fix their state. For more information about reregistering, see Replica shard recovery.

The primary throws a ReplicationVotedToRollbackTransactionException error if too few synchronous replicas voted to commit.

Replication and Loaders

Normally, a primary shard writes changes synchronously through the Loader to a database. The Loader and database are always in sync. When the primary fails over to a replica shard, the database and Loader might not be in synch. For example:

  • The primary can send the transaction to the replica and then fail before committing to the database.
  • The primary can commit to the database and then fail before sending to the replica.

Either approach leads to either the replica being one transaction in front of or behind the database. This situation is not acceptable. eXtreme Scale uses a special protocol and a contract with the Loader implementation to solve this issue without two phase commit. The protocol follows:

Primary side

  • Send the transaction along with the previous transaction outcomes.
  • Write to the database and try to commit the transaction.
  • If the database commits, then commit on eXtreme Scale. If the database does not commit, then roll back the transaction.
  • Record the outcome.

Replica side

  • Receive a transaction and buffer it.
  • For all outcomes, send with the transaction, commit any buffered transactions and discard any rolled back transactions.

Replica side on failover

  • For all buffered transactions, provide the transactions to the Loader and the Loader attempts to commit the transactions.
  • The Loader needs to be written to make each transaction is idempotent.
  • If the transaction is already in the database, then the Loader performs no operation.
  • If the transaction is not in the database, then the Loader applies the transaction.
  • After all transactions are processed, then the new primary can begin to serve requests.

This protocol ensures that the database is at the same level as the new primary state.