Quorum is the minimum number of catalog servers necessary to conduct placement operations for the grid. The minimum number is the full set of catalog servers unless quorum has been overridden.
The following is a list of terms related to quorum considerations for eXtreme Scale.
This section explains how IBM WebSphere eXtreme Scale operates across a network that includes unreliable components. Examples of such a network would include a network spanning multiple data centers.
IP address space
WebSphere eXtreme Scale requires a network where any addressable element on the network can connect to any other addressable element on the network unimpeded. This means WebSphere eXtreme Scale requires a flat IP address naming space and it requires all firewalls to all traffic to flow between the IP addresses and ports used by the Java virtual machines (JVM) hosting elements of WebSphere eXtreme Scale.
Connected LANs
Each LAN is assigned a zone identifier for WebSphere eXtreme Scale's requirements. WebSphere eXtreme Scale will aggressively heartbeat JVMs in a single zone and a missed heartbeat will result in a failover event so long as the catalog service has quorum.
Catalog service grid and container servers
A grid is a collection of similar JVMs. A catalog service is a grid composed of catalog servers, and is fixed in size. However, the number of container servers is dynamic. Container servers can be added and removed on demand. In a three-data-center configuration, WebSphere eXtreme Scale requires one catalog service JVM per data center.
The catalog service grid uses a full quorum mechanism. This means that all members of the grid must agree on any action.
Container server JVMs are tagged with a zone identifier. The grid of container JVMs is automatically broken in to small core groups of JVMs. A core group will only include JVMs from the same zone. JVMs from different zones will never be in the same core group.
A core group will aggressively try to detect the failure of its member JVMs. The container JVMs of a core group must never span multiple LANs connected with links like in a wide area network. This means that a core group cannot have containers in the same zone running in different data centers.
Catalog server startup
The catalog servers are started using the startOgServer command. The quorum mechanism is disabled by default. To enable quorum either pass the -quorum enabled flag on the startOgServer command or add the enableQuorum=true property in the property file. All of the catalog servers must be given the same quorum setting.
# bin/startOgServer cat0 –serverProps objectGridServer.properties
objectGridServer.properties file catalogClusterEndPoints=cat0:cat0.domain.com:6600:6601, cat1:cat1.domain.com:6600:6601 catalogServiceEndPoints= cat0.domain.com:2809, cat1.domain.com:2809 enableQuorum=true
Container server startup
The container servers are started using the startOgServer command. When running a grid across data centers the servers must use the zone tag to identify the data center in which they reside. Setting the zone on the grid servers allows WebSphere eXtreme Scale to monitor health of the servers scoped to the data center, minimizing cross-data-center traffic.
# bin/startOgServer gridA0 –serverProps objectGridServer.properties – objectgridfile xml/objectgrid.xml –deploymentpolicyfile xml/deploymentpolicy.xml
objectGridServer.properties file catalogServiceEndPoints= cat0.domain.com:2809, cat1.domain.com:2809 zoneName=ZoneA
Grid server shutdown
The grid servers are stopped using the stopOgServer command. When shutting down an entire data center for maintenance, pass the list of all the servers that belong to that zone. This will allow a clean transition of state from the zone in teardown to the surviving zone or zones.
# bin/startOgServer gridA0,gridA1,gridA2 –catalogServiceEndPoints cat0.domain.com:2809,cat1.domain.com:2809
Failure detection
WebSphere eXtreme Scale detects process death through abnormal socket closure events. The catalog service will be told immediately when a process terminates. A black out is detected through missed heartbeats. WebSphere eXtreme Scale protects itself against brown out conditions across data centers by using a quorum implementation.
This section describes how liveness checking is implemented in WebSphere eXtreme Scale.
Core group member heartbeating
The catalog service places container JVMs into core groups of a limited size. A core group will try to detect the failure of its members using two methods. If a JVM's socket is closed, that JVM is regarded as dead. Each member also heart beats over these sockets at a rate determined by configuration. If a JVM does not respond to these heartbeats within a configured maximum period of time then the JVM is regarded as dead.
A single member of a core group is always elected to be the leader. The core group leader (CGL) is responsible to periodically tell the catalog service that the core group is alive and to report any membership changes to the catalog service. A membership change can be a JVM failing or a newly added JVM joining the core group.
If the core group leader cannot contact any member of the catalog service grid then it will continue to retry.
Catalog service grid heartbeating
The catalog service looks like a private core group with a static membership and a quorum mechanism. It detects failures the same way as a normal core group. However, the behavior is modified to include quorum logic. The catalog service also uses a less aggressive heartbeating configuration.
Core group heartbeating
The catalog service needs to know when container servers fail. Each core group is responsible for determining container JVM failure and reporting this to the catalog service through the core group leader. The complete failure of all members of a core group is also a possibility. If the entire core group has failed, it is the responsibility of the catalog service to detect this loss.
If the catalog service marks a container JVM as failed and the container is later reported as alive, the container JVM will be told to shutdown the WebSphere eXtreme Scale container servers. A JVM in this state will not be visible in xsadmin queries. There will be messages in the logs of the container JVM indicating this has happened. These JVMs need to be manually restarted.
If a quorum loss event has occurred, heartbeating is suspended until quorum is reestablished.
Normally, the members of the catalog service have full connectivity. The catalog service grid is a static set of JVMs. WebSphere eXtreme Scale expects all members of the catalog service to be online always. The catalog service will only respond to container events while the catalog service has quorum.
If the catalog service loses quorum, it will wait for quorum to be reestablished. While the catalog service does not have quorum, it will ignore events from container servers. Container servers will retry any requests rejected by the catalog server during this time as WebSphere eXtreme Scale expects quorum to be reestablished.
The following message indicates that quorum has been lost. Look for this message in your catalog service logs.
CWOBJ1254W: The catalog service is waiting for quorum.
WebSphere eXtreme Scale expects to lose quorum for the following reasons:
Stopping a catalog server instance using stopOgServer does not cause loss of quorum because the system knows the server instance has stopped, which is different from a JVM failure or brown out.
Quorum loss from JVM failure
A catalog server that fails will cause quorum to be lost. In this case, quorum should be overridden as fast as possible. The failed catalog service cannot rejoin the grid until quorum has been overridden.
Quorum loss from network brown out
WebSphere eXtreme Scale is designed to expect the possibility of brown outs. A brown out is when a temporary loss of connectivity occurs between data centers. This is usually transient in nature and brown outs should clear within a matter of seconds or minutes. While WebSphere eXtreme Scale tries to maintain normal operation during the brown out period, a brown out is regarded as a single failure event. The failure is expected to be fixed and then normal operation resumes with no WebSphere eXtreme Scale actions necessary.
A long duration brown out can be classified as a blackout only through user intervention. Overriding quorum on one side of the brown out is required in order for the event to be classified as a black out.
Catalog service JVM cycling
If a catalog server is stopped by using stopOgServer, then the quorum drops to one less server. This means the remaining servers still have quorum. Restarting the catalog server bumps quorum back to the previous number.
Consequences of lost quorum
If a container JVM was to fail while quorum is lost, recovery will not take place until the brown out recovers or in the case of a black out the customer does an override quorum command. WebSphere eXtreme Scale regards a quorum loss event and a container failure as a double failure, which is a rare event. This means that applications may lose write access to data that was stored on the failed JVM until quorum is restored at which time normal recovery will take place.
Similarly, if you attempt to start a container during a quorum loss event, the container will not start.
Full client connectivity is allowed during quorum loss. If no container failures or connectivity issues happen during the quorum loss event then clients can still fully interact with the container servers.
If a brown out occurs then some clients may not have access to primary or replica copies of the data until the brown out clears.
New clients can be started, as there should be a catalog service JVM in each data center so at least one catalog service JVM can be reached by a client even during a brown out event.
Quorum recovery
If quorum is lost for any reason, when quorum is reestablished, a recovery protocol is executed. When the quorum loss event occurs, all liveness checking for core groups is suspended and failure reports are also ignored. Once quorum is back then the catalog service does a liveness check of all core groups to immediately determine their membership. Any shards previously hosted on container JVMs reported as failed will be recovered at this point. If primary shards were lost then surviving replicas will be promoted to primaries. If replica shards were lost then additional replicas will be created on the survivors.
Overriding quorum
This should only be used when a data center failure has occurred. Quorum loss due to a catalog service JVM failure or a network brownout should recovery automatically once the catalog service JVM is restarted or the network brownout clears.
Administrators are the only ones with knowledge of a datacenter failure. WebSphere eXtreme Scale treats a brown out and a black out similarly. You must inform the eXtreme Scale environment of such failures using the xsadmin command to override quorum. This will tell the catalog service to assume that quorum is achieved with the current membership and full recovery will take place. When issuing an override quorum command, you are guaranteeing that the JVMs in the failed data center have truly failed and will not recover.
This section describes how the container server JVMs behave while quorum is lost and recovered.
Containers host one or more shards. Shards are either primaries or replicas for a specific partition. The catalog service assigns shards to a container and the container will honor that assignment until new instructions arrive from the catalog service. This means that if a primary shard in a container cannot communicate with a replica shard because of a brown out then it will continue to retry until it receives new instructions from the catalog service.
If a network brown out occurs and a primary shard loses communication with the replica then it will retry the connection until the catalog service provides new instructions.
Synchronous replica behavior
While the connection is broken the primary can accept new transactions as long as there are at least as many replicas online as the minsync property for the map set. If any new transactions are processed on the primary while the link to the synchronous replica is broken, the replica will be cleared and resynchronized with the current state of the primary when the link is reestablished.
Synchronous replication is strongly discouraged between data centers or over a WAN-style link.
Asynchronous replica behavior
While the connection is broken the primary can accept new transactions. The primary will buffer the changes up to a limit. If the connection with the replica is reestablished before that limit is reached then the replica is updated with the buffered changes. If the limit is reached, then the primary destroys the buffered list and when the replica reattaches then it is cleared and resynchronized.
Clients are always able to connect to the catalog server to bootstrap to the grid whether the catalog service grid has quorum or not. The client will try to connect to any catalog server instance to obtain a route table and then interact with the grid. Network connectivity may prevent the client from interacting with some partitions due to network setup. The client may connect to local replicas for remote data if it has been configured to do so. Clients will not be able to update data if the primary partition for that data is not available.
SSL disabled
You are strongly encouraged to specify a transaction timeout in the ObjectGrid descriptor XML file. The client will retry a single transaction until this timeout is exceeded. If a timeout is not specified, the client will retry indefinitely, which is not recommended, as the client thread would be blocked for an undetermined amount of time. The transaction timeout should be set to a multiple of the maximum expected transaction time.
You are also strongly encouraged to set the ORB request timeout in the orb.properties file. The client will block on a browned out socket for this maximum time. If the elapsed time from when the request is issued is still less than the transaction timeout, WebSphere eXtreme Scale will attempt the request again. The client will retry until the total elapsed time exceeds the transaction timeout. Thus, the maximum client block time is the transaction timeout plus the ORB request timeout.
The ORB will block until the TCP stack closes the browned out socket if the ORB request timeout is not specified. This time depends on the TCP stack tuning and could be hours depending on TCP FIN_WAIT and other related parameters.
This section describes xsadmin commands useful for quorum situations.
Querying quorum status
The quorum status of a catalog server instance can be interrogated using the xsadmin command.
xsadmin –ch cathost –p 1099 –quorumstatus
There are five possible outcomes.
Overriding quorum
The xsadmin command can be used to override quorum. Any surviving catalog server instance can be used. All survivors are notified when one is told to override quorum. The syntax for this is as follows.
xsadmin –ch cathost –p 1099 –overridequorum
Diagnostic commands
xsadmin –ch cathost –p 1099 –coregroups
xsadmin –ch cathost –p 1099 –g Grid –teardown server1,server2,server3
xsadmin –ch cathost –p 1099 –g myGrid -routetable
xsadmin –ch cathost –p 1099 –g myGrid –unassigned
xsadmin –ch cathost –p 1099 –g myGrid –fh host1 –settracespec ObjectGrid*=event=enabled
This enables trace for all JVMs on the box with the host name specified, in this case host1.
xsadmin -ch cathost -p 1099 -g myGrid -m myMapSet -mapsizes myMap