An external agent should be provided that can be hosted separately from the appliance, ideally on a completely separate network, potentially in a separate datacentre. The appliances and the external agent form a ‘cluster' in which all are treated as equal members, except that the agents do not process any actual traffic. In order for an appliance to be active, it must be able to maintain connections to a majority of the nodes in the cluster (whether they are appliances or agents). In order for an appliance to take over processing from another appliance, it must first verify that it is able to connect to a majority of nodes in the cluster. In the event that an appliance loses the ability to communicate with a majority of the cluster, it must shut down its queue managers (as the other appliance may have become active and started processing messages for those QMs).
It should be possible to support multiple agent nodes in the cluster as long as the total number of nodes is an odd number. This is to ensure that patching can be performed without creating risks of unnecessary outages due to loss of quorum on the remaining nodes. (i.e. if there are only 3 nodes in the cluster, taking the agent offline for patching will leave the cluster unable to failover until patching completes).
Note that this solution would allow some duplicate processing to occur if the appliances are not set up to require synchronous persistence of every log write – however where log writes are required to be synchronous this problem is prevented. However, it does mean that the cluster state must be resolved before an appliance can cease replication of logs in a situation where the other appliance has become inaccessible.
In addition, I believe there is value in understanding some of the intelligence that has been built into PowerHA – such as preventing nodes from becoming live if they are unable to ping their nearest hop in the network path to the other cluster node as a method of detecting local adapter or local network failures.
Also, HA heartbeat intervals between appliances will need to be tunable to allow management of:
- Frequency of heartbeats
- Number of missed heartbeats before a cluster member is ‘in-doubt'
- Number of missed heartbeats before a cluster member is evicted.
Changes to cluster state should result in an event message to a queue. Possibly a non-MQ interface (e.g. SNMP) should be designated as well so that cases where MQ is halted can still result in alerts to external monitoring solutions.
If the fibrechannel allows access to SAN disk, there may also be value in including heartbeat-via-SAN mechanisms as an alternative or complement to TCP heartbeating between hosts. In this case, the SAN becomes a form of witness host. This places some burden on the customer to implement a SAN topology that makes sense in the context of HA.
There would also need to be a mechanism to override the cluster logic, to designate a specific host as active in a situation where the cluster is degraded and all QMs have been shut down.
Finally, the solution should handle scenarios where the MQ appliances have lost connectivity between themselves, but both are still able to contact the quorum agent. In this case, the agent should include logic to arbitrate between the nodes and determine if either node should be evicted.
A number of updates to MQ Appliance HA capability are under consideration which would address aspects of this RFE, including 3 node quorum support and additional tuning options. Note that SAN support in the appliance is no longer considered a strategic capability (and is omitted from the M2002 hardware) so this will not form part of any long term solution.