Catalog server quorums

Quorum is the minimum number of catalog servers necessary to conduct placement operations for the grid. The minimum number is the full set of catalog servers unless quorum has been overridden.

Important terms

The following is a list of terms related to quorum considerations for eXtreme Scale.

Brown out: A brown out is the temporary loss of connectivity between one or more servers.
Black out: A black out is the permanent loss of connectivity between one or more servers.
Data center: A data center is a geographically located group of servers generally connected with a local area network (LAN).
Zone: A zone is a configuration option used to group servers together that share some physical characteristic. For example servers in a data center can all be marked in a zone.
Heartbeat: A heartbeat is the act of pinging a Java virtual machine (JVM) to determine its liveness.

Topology

This section explains how IBM WebSphere eXtreme Scale operates across a network that includes unreliable components. Examples of such a network would include a network spanning multiple data centers.

IP address space

WebSphere eXtreme Scale requires a network where any addressable element on the network can connect to any other addressable element on the network unimpeded. This means WebSphere eXtreme Scale requires a flat IP address naming space and it requires all firewalls to all traffic to flow between the IP addresses and ports used by the Java virtual machines (JVM) hosting elements of WebSphere eXtreme Scale.

Connected LANs

Each LAN is assigned a zone identifier for WebSphere eXtreme Scale's requirements. WebSphere eXtreme Scale will aggressively heartbeat JVMs in a single zone and a missed heartbeat will result in a failover event so long as the catalog service has quorum.

Catalog service grid and container servers

A grid is a collection of similar JVMs. A catalog service is a grid composed of catalog servers, and is fixed in size. However, the number of container servers is dynamic. Container servers can be added and removed on demand. In a three-data-center configuration, WebSphere eXtreme Scale requires one catalog service JVM per data center.

The catalog service grid uses a full quorum mechanism. This means that all members of the grid must agree on any action.

Container server JVMs are tagged with a zone identifier. The grid of container JVMs is automatically broken in to small core groups of JVMs. A core group will only include JVMs from the same zone. JVMs from different zones will never be in the same core group.

A core group will aggressively try to detect the failure of its member JVMs. The container JVMs of a core group must never span multiple LANs connected with links like in a wide area network. This means that a core group cannot have containers in the same zone running in different data centers.

Server life cycle

Catalog server startup

The catalog servers are started using the startOgServer command. The quorum mechanism is disabled by default. To enable quorum either pass the -quorum enabled flag on the startOgServer command or add the enableQuorum=true property in the property file. All of the catalog servers must be given the same quorum setting.

# bin/startOgServer cat0 –serverProps objectGridServer.properties

objectGridServer.properties file
catalogClusterEndPoints=cat0:cat0.domain.com:6600:6601,
cat1:cat1.domain.com:6600:6601
catalogServiceEndPoints= cat0.domain.com:2809, cat1.domain.com:2809
enableQuorum=true

Container server startup

The container servers are started using the startOgServer command. When running a grid across data centers the servers must use the zone tag to identify the data center in which they reside. Setting the zone on the grid servers allows WebSphere eXtreme Scale to monitor health of the servers scoped to the data center, minimizing cross-data-center traffic.

# bin/startOgServer gridA0 –serverProps objectGridServer.properties– objectgridfile xml/objectgrid.xml –deploymentpolicyfile xml/deploymentpolicy.xml

objectGridServer.properties file
catalogServiceEndPoints= cat0.domain.com:2809, cat1.domain.com:2809
zoneName=ZoneA

Grid server shutdown

The grid servers are stopped using the stopOgServer command. When shutting down an entire data center for maintenance, pass the list of all the servers that belong to that zone. This will allow a clean transition of state from the zone in teardown to the surviving zone or zones.

# bin/startOgServer gridA0,gridA1,gridA2 –catalogServiceEndPoints cat0.domain.com:2809,cat1.domain.com:2809

Failure detection

WebSphere eXtreme Scale detects process death through abnormal socket closure events. The catalog service will be told immediately when a process terminates. A black out is detected through missed heartbeats. WebSphere eXtreme Scale protects itself against brown out conditions across data centers by using a quorum implementation.

Heartbeating implementation

This section describes how liveness checking is implemented in WebSphere eXtreme Scale.

Core group member heartbeating

The catalog service places container JVMs into core groups of a limited size. A core group will try to detect the failure of its members using two methods. If a JVM's socket is closed, that JVM is regarded as dead. Each member also heart beats over these sockets at a rate determined by configuration. If a JVM does not respond to these heartbeats within a configured maximum period of time then the JVM is regarded as dead.

A single member of a core group is always elected to be the leader. The core group leader (CGL) is responsible to periodically tell the catalog service that the core group is alive and to report any membership changes to the catalog service. A membership change can be a JVM failing or a newly added JVM joining the core group.

If the core group leader cannot contact any member of the catalog service grid then it will continue to retry.

Catalog service grid heartbeating

The catalog service looks like a private core group with a static membership and a quorum mechanism. It detects failures the same way as a normal core group. However, the behavior is modified to include quorum logic. The catalog service also uses a less aggressive heartbeating configuration.

Core group heartbeating

The catalog service needs to know when container servers fail. Each core group is responsible for determining container JVM failure and reporting this to the catalog service through the core group leader. The complete failure of all members of a core group is also a possibility. If the entire core group has failed, it is the responsibility of the catalog service to detect this loss.

If the catalog service marks a container JVM as failed and the container is later reported as alive, the container JVM will be told to shutdown the WebSphere eXtreme Scale container servers. A JVM in this state will not be visible in xsadmin queries. There will be messages in the logs of the container JVM indicating this has happened. These JVMs need to be manually restarted.

If a quorum loss event has occurred, heartbeating is suspended until quorum is reestablished.

Catalog service quorum behavior

Normally, the members of the catalog service have full connectivity. The catalog service grid is a static set of JVMs. WebSphere eXtreme Scale expects all members of the catalog service to be online always. The catalog service will only respond to container events while the catalog service has quorum.

If the catalog service loses quorum, it will wait for quorum to be reestablished. While the catalog service does not have quorum, it will ignore events from container servers. Container servers will retry any requests rejected by the catalog server during this time as WebSphere eXtreme Scale expects quorum to be reestablished.

The following message indicates that quorum has been lost. Look for this message in your catalog service logs.

CWOBJ1254W: The catalog service is waiting for quorum.

WebSphere eXtreme Scale expects to lose quorum for the following reasons:

Catalog service JVM member fails
Network brown out
Data center loss

Stopping a catalog server instance using stopOgServer does not cause loss of quorum because the system knows the server instance has stopped, which is different from a JVM failure or brown out.

Quorum loss from JVM failure

A catalog server that fails will cause quorum to be lost. In this case, quorum should be overridden as fast as possible. The failed catalog service cannot rejoin the grid until quorum has been overridden.

Quorum loss from network brown out

WebSphere eXtreme Scale is designed to expect the possibility of brown outs. A brown out is when a temporary loss of connectivity occurs between data centers. This is usually transient in nature and brown outs should clear within a matter of seconds or minutes. While WebSphere eXtreme Scale tries to maintain normal operation during the brown out period, a brown out is regarded as a single failure event. The failure is expected to be fixed and then normal operation resumes with no WebSphere eXtreme Scale actions necessary.

A long duration brown out can be classified as a blackout only through user intervention. Overriding quorum on one side of the brown out is required in order for the event to be classified as a black out.

Catalog service JVM cycling

If a catalog server is stopped by using stopOgServer, then the quorum drops to one less server. This means the remaining servers still have quorum. Restarting the catalog server bumps quorum back to the previous number.

Consequences of lost quorum

If a container JVM was to fail while quorum is lost, recovery will not take place until the brown out recovers or in the case of a black out the customer does an override quorum command. WebSphere eXtreme Scale regards a quorum loss event and a container failure as a double failure, which is a rare event. This means that applications may lose write access to data that was stored on the failed JVM until quorum is restored at which time normal recovery will take place.

Similarly, if you attempt to start a container during a quorum loss event, the container will not start.

Full client connectivity is allowed during quorum loss. If no container failures or connectivity issues happen during the quorum loss event then clients can still fully interact with the container servers.

If a brown out occurs then some clients may not have access to primary or replica copies of the data until the brown out clears.

New clients can be started, as there should be a catalog service JVM in each data center so at least one catalog service JVM can be reached by a client even during a brown out event.

Quorum recovery

If quorum is lost for any reason, when quorum is reestablished, a recovery protocol is executed. When the quorum loss event occurs, all liveness checking for core groups is suspended and failure reports are also ignored. Once quorum is back then the catalog service does a liveness check of all core groups to immediately determine their membership. Any shards previously hosted on container JVMs reported as failed will be recovered at this point. If primary shards were lost then surviving replicas will be promoted to primaries. If replica shards were lost then additional replicas will be created on the survivors.

Overriding quorum

This should only be used when a data center failure has occurred. Quorum loss due to a catalog service JVM failure or a network brownout should recovery automatically once the catalog service JVM is restarted or the network brownout clears.

Administrators are the only ones with knowledge of a datacenter failure. WebSphere eXtreme Scale treats a brown out and a black out similarly. You must inform the eXtreme Scale environment of such failures using the xsadmin command to override quorum. This will tell the catalog service to assume that quorum is achieved with the current membership and full recovery will take place. When issuing an override quorum command, you are guaranteeing that the JVMs in the failed data center have truly failed and will not recover.

The following list considers some scenarios for overriding quorum. Say you have three catalog servers: A, B, and C.

Brown out: Say you have a brown out where C is isolated temporarily. The catalog service will lose quorum and wait for the brown out to clear at which point C rejoins the catalog service grid and quorum is reestablished. Your application sees no problems during this time.
Temporary failure: Here, C fails and the catalog service loses quorum, so you must override quorum. Once quorum is reestablished, C can be restarted. C will rejoin the catalog service grid when it is restarted. The application sees no problems during this time.
Data center failure: You verify that the data center actually failed and that it has been isolated on the network. Then you issue the xsadmin override quorum command. The surviving two data centers do full recovery by replacing shards that were hosted in the failed datacenter. The catalog service is now running with a full quorum of A and B. The application may see delays or exceptions during the interval between the start of the black out and when quorum is overridden. Once quorum is overridden the grid recovers and normal operation is resumed.
Data center recovery: The surviving data centers are already running with quorum overridden. When the data center containing C is restarted, all JVMs in the data center must be restarted. Then C will rejoin the existing catalog service grid and quorum will revert to the normal situation with no user intervention.
Data center failure and brown out: The datacenter containing C fails. Quorum is overridden and recovered on the remaining datacenters. If a brown out between A and B occurs, the normal brown out recovery rules apply. Once the brown out clears, quorum is reestablished and necessary recovery from the quorum loss occurs.

Container behavior

This section describes how the container server JVMs behave while quorum is lost and recovered.

Containers host one or more shards. Shards are either primaries or replicas for a specific partition. The catalog service assigns shards to a container and the container will honor that assignment until new instructions arrive from the catalog service. This means that if a primary shard in a container cannot communicate with a replica shard because of a brown out then it will continue to retry until it receives new instructions from the catalog service.

If a network brown out occurs and a primary shard loses communication with the replica then it will retry the connection until the catalog service provides new instructions.

Synchronous replica behavior

While the connection is broken the primary can accept new transactions as long as there are at least as many replicas online as the minsync property for the map set. If any new transactions are processed on the primary while the link to the synchronous replica is broken, the replica will be cleared and resynchronized with the current state of the primary when the link is reestablished.

Synchronous replication is strongly discouraged between data centers or over a WAN-style link.

Asynchronous replica behavior

While the connection is broken the primary can accept new transactions. The primary will buffer the changes up to a limit. If the connection with the replica is reestablished before that limit is reached then the replica is updated with the buffered changes. If the limit is reached, then the primary destroys the buffered list and when the replica reattaches then it is cleared and resynchronized.

Client behavior

Clients are always able to connect to the catalog server to bootstrap to the grid whether the catalog service grid has quorum or not. The client will try to connect to any catalog server instance to obtain a route table and then interact with the grid. Network connectivity may prevent the client from interacting with some partitions due to network setup. The client may connect to local replicas for remote data if it has been configured to do so. Clients will not be able to update data if the primary partition for that data is not available.

SSL disabled

You are strongly encouraged to specify a transaction timeout in the ObjectGrid descriptor XML file. The client will retry a single transaction until this timeout is exceeded. If a timeout is not specified, the client will retry indefinitely, which is not recommended, as the client thread would be blocked for an undetermined amount of time. The transaction timeout should be set to a multiple of the maximum expected transaction time.

You are also strongly encouraged to set the ORB request timeout in the orb.properties file. The client will block on a browned out socket for this maximum time. If the elapsed time from when the request is issued is still less than the transaction timeout, WebSphere eXtreme Scale will attempt the request again. The client will retry until the total elapsed time exceeds the transaction timeout. Thus, the maximum client block time is the transaction timeout plus the ORB request timeout.

The ORB will block until the TCP stack closes the browned out socket if the ORB request timeout is not specified. This time depends on the TCP stack tuning and could be hours depending on TCP FIN_WAIT and other related parameters.

Quorum commands with xsadmin

This section describes xsadmin commands useful for quorum situations.

Querying quorum status

The quorum status of a catalog server instance can be interrogated using the xsadmin command.

xsadmin –ch cathost –p 1099 –quorumstatus

There are five possible outcomes.

Quorum is disabled: The catalog servers are running in a quorum-disabled mode. This is a development or single only data center mode. It is not recommended for multi-data-center configurations.
Quorum is enabled and the catalog server has quorum: Quorum is enabled and the system is working normally.
Quorum is enabled but the catalog server is waiting for quorum: Quorum is enabled and quorum has been lost.
Quorum is enabled and the quorum is overridden: Quorum is enabled and quorum has been overridden.
Quorum status is outlawed: When a brown out occurs, splitting the catalog service into two partitions, A and B. The catalog server A has overridden quorum. The network partition resolves and the server in the B partition is outlawed, requiring a JVM restart. It also occurs if the catalog JVM in B restarts during the brown out and then the brown out clears.

Overriding quorum

The xsadmin command can be used to override quorum. Any surviving catalog server instance can be used. All survivors are notified when one is told to override quorum. The syntax for this is as follows.

xsadmin –ch cathost –p 1099 –overridequorum

Diagnostic commands

Quorum status: As described in the previous section.
Coregroup list: This displays a list of all core groups. The members and leaders of the core groups are displayed.
xsadmin –ch cathost –p 1099 –coregroups
Teardown servers: This command removes a server manually from the grid. This is not needed normlaly since servers are automatically removed when they are detected as failed, but the command is provided for use under IBM support help.
xsadmin –ch cathost –p 1099 –g Grid –teardown server1,server2,server3
Display route table: This command shows the current route table by simulating a new client connection to the grid. It also validates the route table by confirming that all container servers are recognizing their role in the route table, such as which type of shard for which partition.
xsadmin –ch cathost –p 1099 –g myGrid -routetable
Display unassigned shards: If some shards cannot be placed on the grid then this can be used to list them. This only happens when the placement service has a constraint preventing placement. For example, if you start JVMs on a single physical box while in production mode then only primary shards can be placed. Replicas will be unassigned until JVMs start on a second box. The placement service only places replicas on JVMs with different IP addresses than the JVMs hosting the primary shards. Having no JVMs in a zone can also cause shards to be unassigned.
xsadmin –ch cathost –p 1099 –g myGrid –unassigned
Set trace settings: This command sets the trace settings for all JVMs matching the filter specified for the xsadmin command. This setting only changes the trace settings until another command is used or the JVMs modified fail or stop.
xsadmin –ch cathost –p 1099 –g myGrid –fh host1 –settracespec ObjectGrid*=event=enabled

This enables trace for all JVMs on the box with the host name specified, in this case host1.
Checking map sizes: The map sizes command is useful for verifying that key distribution is uniform over the shards in the key. If some containers have significantly more keys than others then it is likely the hash function on the key objects has a poor distribution.
xsadmin -ch cathost -p 1099 -g myGrid -m myMapSet -mapsizes myMap