Skip to Main Content
Integration


This is an IBM Automation portal for Integration products. To view all of your ideas submitted to IBM, create and manage groups of Ideas, or create an idea explicitly set to be either visible by all (public) or visible only to you and IBM (private), use the IBM Unified Ideas Portal (https://ideas.ibm.com).


Shape the future of IBM!

We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:

Search existing ideas

Start by searching and reviewing ideas and requests to enhance a product or service. Take a look at ideas others have posted, and add a comment, vote, or subscribe to updates on them if they matter to you. If you can't find what you are looking for,

Post your ideas
  1. Post an idea.

  2. Get feedback from the IBM team and other customers to refine your idea.

  3. Follow the idea through the IBM Ideas process.


Specific links you will want to bookmark for future use

Welcome to the IBM Ideas Portal (https://www.ibm.com/ideas) - Use this site to find out additional information and details about the IBM Ideas process and statuses.

IBM Unified Ideas Portal (https://ideas.ibm.com) - Use this site to view all of your ideas, create new ideas for any IBM product, or search for ideas across all of IBM.

ideasibm@us.ibm.com - Use this email to suggest enhancements to the Ideas process or request help from IBM for submitting your Ideas.


Status Future consideration
Created by Guest
Created on Oct 15, 2015

Quorum/Tiebreaker capability to protect M2000 appliance HA from split-brain during network failure events

An external agent should be provided that can be hosted separately from the appliance, ideally on a completely separate network, potentially in a separate datacentre. The appliances and the external agent form a ‘cluster' in which all are treated as equal members, except that the agents do not process any actual traffic. In order for an appliance to be active, it must be able to maintain connections to a majority of the nodes in the cluster (whether they are appliances or agents). In order for an appliance to take over processing from another appliance, it must first verify that it is able to connect to a majority of nodes in the cluster. In the event that an appliance loses the ability to communicate with a majority of the cluster, it must shut down its queue managers (as the other appliance may have become active and started processing messages for those QMs).

It should be possible to support multiple agent nodes in the cluster as long as the total number of nodes is an odd number. This is to ensure that patching can be performed without creating risks of unnecessary outages due to loss of quorum on the remaining nodes. (i.e. if there are only 3 nodes in the cluster, taking the agent offline for patching will leave the cluster unable to failover until patching completes).

Note that this solution would allow some duplicate processing to occur if the appliances are not set up to require synchronous persistence of every log write – however where log writes are required to be synchronous this problem is prevented. However, it does mean that the cluster state must be resolved before an appliance can cease replication of logs in a situation where the other appliance has become inaccessible.

In addition, I believe there is value in understanding some of the intelligence that has been built into PowerHA – such as preventing nodes from becoming live if they are unable to ping their nearest hop in the network path to the other cluster node as a method of detecting local adapter or local network failures.

Also, HA heartbeat intervals between appliances will need to be tunable to allow management of:

- Frequency of heartbeats
- Number of missed heartbeats before a cluster member is ‘in-doubt'
- Number of missed heartbeats before a cluster member is evicted.

Changes to cluster state should result in an event message to a queue. Possibly a non-MQ interface (e.g. SNMP) should be designated as well so that cases where MQ is halted can still result in alerts to external monitoring solutions.

If the fibrechannel allows access to SAN disk, there may also be value in including heartbeat-via-SAN mechanisms as an alternative or complement to TCP heartbeating between hosts. In this case, the SAN becomes a form of witness host. This places some burden on the customer to implement a SAN topology that makes sense in the context of HA.

There would also need to be a mechanism to override the cluster logic, to designate a specific host as active in a situation where the cluster is degraded and all QMs have been shut down.

Finally, the solution should handle scenarios where the MQ appliances have lost connectivity between themselves, but both are still able to contact the quorum agent. In this case, the agent should include logic to arbitrate between the nodes and determine if either node should be evicted.

Idea priority High
RFE ID 78428
RFE URL
RFE Product IBM MQ
  • Guest
    Reply
    |
    Nov 13, 2018

    A number of updates to MQ Appliance HA capability are under consideration which would address aspects of this RFE, including 3 node quorum support and additional tuning options. Note that SAN support in the appliance is no longer considered a strategic capability (and is omitted from the M2002 hardware) so this will not form part of any long term solution.