Design considerations for multi-master replication

When implementing multi-master replication, you must consider aspects in your design such as: arbitration, linking, and performance.

Arbitration considerations in topology design

Change collisions might occur if the same records can be changed simultaneously in two places. Set up each catalog service domain to have about the same amount of processor, memory, network resources. You might observe that catalog service domains performing change collision handling (arbitration) use more resources than other catalog service domains. Collisions are detected automatically. They are handled with one of two mechanisms:
  • Default collision arbiter: The default protocol is to use the changes from the lexically lowest named catalog service domain. For example, if catalog service domain A and B generate a conflict for a record, then the change from catalog service domain B is ignored. Catalog service domain A keeps its version and the record in catalog service domain B is changed to match the record from catalog service domain A. This behavior applies as well for applications where users or sessions are normally bound or have affinity with one of the data grids.
  • Custom collision arbiter: Applications can provide a custom arbiter. When a catalog service domain detects a collision, it starts the arbiter. For information about developing a useful custom arbiter, see Developing custom arbiters for multi-master replication.
For topologies in which collisions are possible, consider implementing a hub-and-spoke topology or a tree topology. These two topologies are conducive to avoiding constant collisions, which can happen in the following scenarios:
  1. Multiple catalog service domains experience a collision
  2. Each catalog service domain handles the collision locally, producing revisions
  3. The revisions collide, resulting in revisions of revisions

To avoid collisions, choose a specific catalog service domain, called an arbitration catalog service domain as the collision arbiter for a subset of catalog service domains. For example, a hub-and-spoke topology might use the hub as the collision handler. The spoke collision handler ignores any collisions that are detected by the spoke catalog service domains. The hub catalog service domain creates revisions, preventing unexpected collision revisions. The catalog service domain that is assigned to handle collisions must link to all of the domains for which it is responsible for handling collisions. In a tree topology, any internal parent domains handle collisions for their immediate children. In contrast, if you use a ring topology, you cannot designate one catalog service domain in the ring as the arbiter.

The following table summarizes the arbitration approaches that are most compatible with various topologies.
Table 1. Arbitration approaches. This table states whether application arbitration is compatible with various technologies.
Topology Application Arbitration? Notes
A line of two catalog service domains Yes Choose one catalog service domain as the arbiter.
A line of three catalog service domains Yes The middle catalog service domain must be the arbiter. Think of the middle catalog service domain as the hub in a simple hub-and-spoke topology.
A line of more than three catalog service domains No Application arbitration is not supported.
A hub with N spokes Yes Hub with links to all spokes must be the arbitration catalog service domain.
A ring of N catalog service domains No Application arbitration is not supported.
An acyclic, directed tree (n-ary tree) Yes All root nodes must rate their direct descendants only.

Linking considerations in topology design

Ideally, a topology includes the minimum number of links while optimizing trade-offs among change latency, fault tolerance, and performance characteristics.
  • Change latency

    Change latency is determined by the number of intermediate catalog service domains a change must go through before arriving at a specific catalog service domain.

    A topology has the best change latency when it eliminates intermediate catalog service domains by linking every catalog service domain to every other catalog service domain. However, a catalog service domain must perform replication work in proportion to its number of links. For large topologies, the sheer number of links to be defined can cause an administrative burden.

    The speed at which a change is copied to other catalog service domains depends on additional factors, such as:
    • Processor and network bandwidth on the source catalog service domain
    • The number of intermediate catalog service domains and links between the source and target catalog service domain
    • The processor and network resources available to the source, target, and intermediate catalog service domains
  • Fault tolerance

    Fault tolerance is determined by how many paths exist between two catalog service domains for change replication.

    If you have only one link between a given pair of catalog service domains, a link failure disallows propagation of changes. Similarly, changes are not propagated between catalog service domains if any of the intermediate domains experiences link failure. Your topology could have a single link from one catalog service domain to another such that the link passes through intermediate domains. If so, then changes are not propagated if any of the intermediate catalog service domains is down.

    Consider the line topology with four catalog service domains A, B, C, and D:

    Line topology

    If any of these conditions hold, Domain D does not see any changes from A:
    • Domain A is up and B is down
    • Domains A and B are up and C is down
    • The link between A and B is down
    • The link between B and C is down
    • The link between C and D is down
    In contrast, with a ring topology, each catalog service domain can receive changes from either direction.

    Ring topology

    For example, if a given catalog service in your ring topology is down, then the two adjacent domains can still pull changes directly from each other.

    All changes are propagated through the hub. Thus, as opposed to the line and ring topologies, the hub-and-spoke design is susceptible to break drown if the hub fails.

    Hub-and-spoke topology

    A single catalog service domain is resilient to a certain amount of service loss. However, larger failures such as wide network outages or loss of links between physical data centers can disrupt any of your catalog service domains.

  • Linking and performance

    The number of links defined on a catalog service domain affects performance. More links use more resources and replication performance can drop as a result. The ability to retrieve changes for a domain A through other domains effectively offloads domain A from replicating its transactions everywhere. The change distribution load on a domain is limited by the number of links it uses, not how many domains are in the topology. This load property provides scalability, so the domains in the topology can share the burden of change distribution.

    A catalog service domain can retrieve changes indirectly through other catalog service domains. Consider a line topology with five catalog service domains.
    A <=> B <=> C <=> D <=> E
    • A pulls changes from B, C, D, and E through B
    • B pulls changes from A and C directly, and changes from D and E through C
    • C pulls changes from B and D directly, and changes from A through B and E through D
    • D pulls changes from C and E directly, and changes from A and B through C
    • E pulls changes from D directly, and changes from A, B, and C through D

    The distribution load on catalog service domains A and E is lowest, because they each have a link only to a single catalog service domain. Domains B, C, and D each have a link to two domains. Thus, the distribution load on domains B, C, and D is double the load on domains A and E. The workload depends on the number of links in each domain, not on the overall number of domains in the topology. Thus, the described distribution of loads would remain constant, even if the line contained 1000 domains.

Multi-master replication performance considerations

Take the following limitations into account when using multi-master replication topologies: