MQ and operating system products provide lots of options to assist with availability
– Many interact and can work well in conjunction with one another
• But it’s the whole stack which is important …
– Ensure your application works in these environments
• Decide which failures you need to protect against
– And the potential effects of those failures
Techniques and technologies to ensure availability of messaging
• Anything that can cause an outage is significant
– e.g. an overloaded system
• You can have the best HA technology in the world, but you have to manage it correctly
Single Points of Failure
• With no redundancy or fault tolerance, a failure of any component can lead to a loss of availability
• Every component is critical. The system relies on the:
– Power supply, system unit, CPU, memory
– Disk controller, disks, network adapter, network cable
– …and so on
• Various techniques have been developed to tolerate failures:
– UPS or dual supplies for power loss
– RAID for disk failure
– Fault-tolerant architectures for CPU/memory failure
• The objective is to achieve 24×7 availability of messaging. Applications should be processing messages continuously, regardless of any failures in any component. This presentation concentrates on the MQ and MB element, but they are not the only areas to think about.
• Availability is not necessarily the same as ensuring processing of each and every message. In some situations, some limited message loss is acceptable provided that availability is maximised. For example, a message might expire or be superseded during an outage. Here, the important thing is to ensure that messages are still getting through.
• Service Level Agreements (SLAs) should define what level of availability your applications and services should provide. The level of availability is often measured by the number of 9s.
• HA solutions should increase availability given scheduled or unscheduled downtime. Scheduled downtime is more common than unscheduled. Availability issues usually involve a multitude of hardware and software systems.
• Avoid application awareness of availability solutions and aim to have little or no code in the application managing the environment. That’s a task better left to systems administrators.
• The applications also need to be resilient to failures, since the messages will only flow if the applications are available to produce and consume them.
• Sharing cluster queues on multiple queue managers prevents a queue from being a SPOF
• Cluster workload algorithm automatically routes traffic away from failed queue managers
• Although queue manager clustering does provide some facilities useful in maintaining availability of messaging, it is
primarily a workload distribution feature. It is simple to deploy extra processing power in the cluster to process more messages.
• If a queue manager in a cluster fails, the failure can be mitigated by other cluster queue managers hosting instances of the cluster queues. Messages are marooned on the failed queue manager until it restarts, but messaging through the cluster is still operational.
Failover is the automatic switching of availability of a service
– For MQ, the “service” is a queue manager
• Traditionally the preserve of an HA cluster, such as HACMP
– Data accessible on all servers
– Equivalent or at least compatible servers
• Common software levels and environment
– Sufficient capacity to handle workload after failure
• Workload may be rebalanced after failover requiring spare capacity
– Startup processing of queue manager following the failure
• MQ offers two ways of configuring for failover:
– Multi-instance queue managers
– HA clusters
Requirement to access data
– Networked storage for a multi-instance queue manager
– Shared disks for an HA cluster, usually “switchable” between the servers
• Requirement for client connectivity
– IP address takeover (IPAT) is generally a feature of failover environments
– If a queue manager changes IP address, intelligent routers can hide this or MQ
network configuration can be defined with alternative addresses
• Servers must be equivalent
– Common software levels – or at least compatible, to allow for progressive
upgrade of the servers
– Common environments – paths, userids, security
• Sufficient capacity to handle workload
– Often, workload will be redistributed following a failover. Often, the systems are
configured for mutual takeover where the workload following failover is doubled
since the surviving servers must handle the traffic intended for both.
Failover times are made up of three parts:
– Time taken to notice the failure
• Heartbeat missed
• Bad result from status query
– Time taken to establish the environment before activating the service
• Switching IP addresses and disks, and so on
– Time taken to activate the service
• This is queue manager restart
• Failover involves a queue manager restart
– Nonpersistent messages, nondurable subscriptions discarded
• For fastest times, ensure that queue manager restart is fast
– No long running transactions, for example
Queue manager data can be placed in networked storage
– Data is available to multiple machines concurrently
– Networked storage can be NAS or a cluster file system
• Already have SAN support
– Protection against concurrent starting two instances of a queue manager
using the same queue manager data
– On Windows, support for Windows network drives (SMB)
– On Unix variants, support for Posix-compliant filesystems with leased file
• NFS v4 has been tested by IBM
• Some customers have a “no local disk” policy for queue manager
– This is an enabler for some virtualized deployments
– Allows simple switching of queue manager to another server following a
While not directly an HA technology, this is an enabler for customers who want to place all of the data remote from their servers such that it becomes possible to replace one server with another in the event of a failure.
• Support has been added for networked storage for queue manager data and logs. Previously, it’s been supported for error and trace directories, and for installation binaries.
• On Windows, we support Windows network drives (SMB).
• On Unix platforms, we support Posix-compliant filesystems which supports lease- based file locking. The lease-based locking ensures that files unlock when the server running a queue manager fails. This rules out NFS v3 for use in an HA environment because the file locks are not released automatically for some failures and this will prevent failover.
• On Unix, we have provided a test program (amqmfsck) which checks out the filesystem’s behaviour. If the tests do not pass, a queue manager using the filesystem will not behave correctly.
Multi-instance Queue Managers
• Basic failover support without HA cluster
• Two instances of a queue manager on different machines
– One is the “active” instance, other is the “standby” instance
– Active instance “owns” the queue manager’s files
• Accepts connections from applications
– Standby instance monitors the active instance
• Applications cannot connect to the standby instance
• If active instance fails, standby performs queue manager restart and becomes
• Instances are the SAME queue manager – only one set of queue
– Queue manager data is held in networked storage
“Basic failover”: no coordination with other resources like disks, IP
addresses, databases, user applications. There is also no sophisticated
control over where the queue managers run and move to (like a 3-node HA
cluster, for example). Finally, once failover has occurred, it is necessary to
manually start a new standby instance.
• Architecturally, this is essentially the same as an existing HACMP/VCS
setup, with the data shared between systems. It does not give anything
“stronger” in terms of availability – but we do expect the typical takeover
time to be significantly less. And it is much simpler to administer.
• Just as with a configuration using an HA cluster, the takeover is in essence
a restart of the queue manager, so nonpersistent messages are discarded,
queue manager channels go into retry, and so on.
ll queue manager administration must be performed on the
• dspmq enhanced to display instance information
– dspmq issued on hosta
– On hosta, there’s a standby instance
– The active instance is on hostb
$ hostname hosta $ dspmq -x QMNAME(MIQM) STATUS(Running as standby) INSTANCE(hostb) MODE(Active) INSTANCE(hosta) MODE(Standby)
MQ traditionally made highly available using an HA cluster
– IBM PowerHA for AIX (formerly HACMP), Veritas Cluster Server, Microsoft ClusterServer, HP Serviceguard, …
• HA clusters can:
– Coordinate multiple resources such as application server, database
– Consist of more than two machines
– Failover more than once without operator intervention
– Takeover IP address as part of failover
– Applicable to more use-cases than multi-instance queue managers
The disks in an HA cluster are switchable shared disks
– Not networked storage as used by multi-instance queue managers
In an HA cluster, queue manager data and logs are placed on a shared disk
– Disk is switched between machines during failover
• The queue manager has its own “service” IP address
– IP address is switched between machines during failover
– Queue manager’s IP address remains the same after failover
• The queue manager is defined to the HA cluster as a resource dependent on the shared disk and the IP address
– During failover, the HA cluster will switch the disk, take over the IP address and then start the queue manager
• The collection of servers that makes up a failover environment is known as
a cluster. The servers are typically referred to as nodes.
• One nodes runs an application or service such as a queue manager, while
the HA cluster monitors its health. The following example is called a cold
standby setup because the other nodes are not running workload of their
own. The standby node is ready to accept the workload being performed
by the active node should it fail.
• A shared disk is a common approach to transferring state information
about the application from one node to another, but is not the only solution.
In most systems, the disks are not accessed concurrently by both nodes,
but are accessible from either node, which take turns to “own” each disk or
set of disks. In other systems the disks are concurrently visible to both (all)
nodes, and lock management software is used to arbitrate read or write
• Alternatively, disk mirroring can be used instead of shared disk. An
advantage of this is increased geographical separation, but latency limits
the distance that can be achieved. But for reliability, any synchronous disk
writes must also be sent down the wire before being confirmed.
Creating a QM in an HA cluster
• Create filesystems on the shared disk, for example
– /MQHA/QM1/data for the queue manager data
– /MQHA/QM1/log for the queue manager logs
On one of the nodes:
• Mount the filesystems
• Create the queue manager
– crtmqm –md /MQHA/QM1/data –ld /MQHA/QM1/log QM1
• Print out the configuration information for use on the other nodes
– dspmqinf –o command QM1
On the other nodes:
• Mount the filesystems
• Add the queue manager’s configuration information
– addmqinf –s QueueManager –v Name=QM1 –v Prefix=/var/mqm –v DataPath=/MQHA/QM1/data/QM1 –v Directory=QM1
Active – Standby Setup
Both the machines have its own MQ installations and share ME data on the shared storage . When we start qmgr with strmqm -x option them first instance obtain the lock on the file system and when we start second instance then it say instance running else where and dspmq -x gives Running as standby
IF Active instance fails then lock free and standby instance take control on the file system and standby instance become Active now
While failover from Active to standby there will be some delay and clients will need to wait till the QMGR become full active
Once Active is available fully then client connection will connect and exchanging message .
If an application loses its connection to a queue manager, what does it do?
– End abnormally
– Handle the failure and retry the connection
– Reconnect automatically in Websphere application container
• WebSphere Application Server contains logic to reconnect
– Use MQ automatic client reconnection
In the Above Diagram we have see Active-Standby with 2 servers . Now we can setup Active – Active setup with the same qmgr’s but Each sever will have one ACTIVE instance . In failover happen both them QMGRS will be available on the same QMGR .Make sure that server has enough capacity to process 2 MQGR load .
So overall we have see 3 types of MQ HA .
- Active-Stand by with Shared IP ,Shared storage and cluster Technologies .
- Active-Standby (Multi Instance setup)
- Active-Active ( Using Active -Standby /Active – Standby )