HiveMQ: High Availability Through Replication and Failover
Understanding Availability
Have you ever gotten frustrated over connectivity loss trying to find a place to enjoy a good meal after a lengthy trip to another country? A smartphone suddenly becomes a useless accessory leaving you to your own devices, that is, eyes and legs. Having dietary restrictions may complicate your search: finding the right place to eat transforms from a 20-minute adventure into a standalone questline.
We experience a somewhat similar level of frustration when a WhatsApp or a Telegram message with the location of your meeting is stubbornly stuck on your phone and is not willing to get through to your friend. If you have changed your plans while on the train and suggested a different place to meet, the cost of message delay might vary from tiny to substantial. If the original location was in a different part of the city, then every additional minute of delay for the message will cost your friend quite a bit of time to correct their way. They may end up waiting for a train in the opposite direction or taking a detour to find an appropriate highway exit.
What do these situations have in common aside from frustration? What unifies them is that the service that you’ve relied upon was unavailable. It might not have been generally unavailable but it stopped responding to your queries and thus had a negative impact on the course of your day. There is hardly a single person who remains calm when their mobile app just sits there idly waiting for the service or connection to be back.
Most of the services and apps luckily do not fall into the mission-critical category. That means that even if you have to wait for the service a bit longer, that won’t result in life-threatening outcomes.
Something more than anyone’s frustration might be at stake when the components of public infrastructure, like highways and railroads, become unavailable. The reasons vary: there might be long-awaited construction works to increase the throughput of the highway, there might be strikes of train drivers, or a track might get damaged due to an earthquake or a fallen tree.
Unavailability of the public transportation service will likely lead to more dramatic consequences than unavailability of the messenger app. A stuck train may cost a customer meeting whereas an ambulance taking a detour may cost a life. The weight of expectations on the public infrastructure is substantial. Expectations, however, differ depending on the area of application and even on a specific individual.
Can we even say what it means for a service or an application to be available?
Application Availability
Availability is one of these terms that are used widely and at the same time have surrounding technical credibility aura. To be a real technical term, availability is too dependent on the individual perceptions of a person that uses a service or an app.
Try yourself. Would you consider Google search available if it returns search results in under several milliseconds? Most likely, yes. It takes hundreds of milliseconds for the neural impulse to propagate through your brain. What if Google returns the result in 150 milliseconds or maybe half a second? You would likely notice a delay. In that sense, the result of your search will not be available instantaneously. So, in your perception, the availability of Google search fell. What if you got distracted by catching a bus or a train after submitting your search query? Then you won’t likely notice even multiple seconds of delay.
Defining availability of a system is very tricky. The definition of availability very much depends on the category of the application as well as on the expectations of the end users.
Some applications face real people as their users, for example, mobile apps, search engines, and online retail. Other applications, like databases and MQTT brokers, in general, are not exposed to the individuals. These software infrastructure components do not perform any standalone function. They are incorporated into applications such as a search engine or an infotainment system for connected vehicles.
Since infrastructure software is, in general, not exposed to the end user and does not cover application or service needs end-to-end, it is quite natural to define the expectations from these pieces of software explicitly. This can be done in various ways, for example, in the form of time budgets. For a specific operation, like passing a message from entering an MQTT broker till exiting it to be sent to the subscriber, one could allocate 5 milliseconds. This would define a latency of the message passing through the broker.
Most infrastructure software is shared among multiple connections whose count is not hard-limited. Hence, if the incoming load is high enough (by whatever definition of ‘enough’), the system may run out of its limited resources and will have to postpone processing of some load. It would mean that the most of the load still perfectly satisfies the imposed time budget whereas maybe several messages will get delayed for a bit longer. Therefore, such latency goals are frequently defined in respect to some percentage of load. For example, one could say that 99% of messages should pass through the broker in under 5 milliseconds.
But even when defining such more accurate objectives, one would have to be mindful about the usage patterns. In the case of an MQTT broker, which is a stateful application, the subscribers may go offline but still ask to keep their session so that the messages that match the subscription could be queued and wait for the subscriber to be back online. If a subscriber goes offline and returns in maybe an hour, its queued messages will be delivered, but what would be their latency? By the above definition, it would be on the order of an hour. But would that mean that the subscribers actually experienced service unavailability? Not at all. They have been offline for this hour so from their perspective the messages were received as soon as they were back online. Such nuances have to be considered when defining availability for infrastructure software. In that respect, defining availability in terms of X nines (for example, five nines which translates to roughly 5 minutes of downtime per year) might be considered up for debate by the users of such infrastructure.
In most cases, this definition suffers from being vendor-specific; that is, the notion of what it means for a system to be available depends on the interpretation that the data infrastructure vendor (for example, a cloud services provider) adopts. Even if your application users experience high latencies and therefore consider it to be unavailable because of a temporary network hiccup, according to the data infrastructure provider, the service was still available because what is counted as unavailability was the total outage, not slowness of the packet delivery.
Defining availability for normal operations has many shades of gray. It is use-case-dependent and, beyond that, perception-dependent. There is, however, a universal condition of a software system that leads to reduced availability regardless of the definition. This condition is a fault of the system or one of its components.
Enterprise software may be subject to various kinds of faults. Broadly speaking, these faults can be grouped into software faults, hardware faults, and network disruptions. Given a piece of software that is deployed on a single server, either the software or the hardware fault would render it entirely unavailable until the software is restarted or the server is swapped for a functioning one. Network disruptions can also affect availability of an application that is deployed on a single server if it is some web application. In that case, network service disruption between the client and the server application may result in delayed processing of the user messages or their ultimate loss.
To achieve fault tolerance against the software and hardware issues, the software is made distributed[1]; that is, multiple instances of the same program are executed on multiple servers. Since for most applications being available means correct and in-time processing of some data, not only is the application executed in multiple instances but also multiple copies of the data are created and allocated on the application servers. This approach to achieve high availability through tolerating faults of software and hardware is known as replication.
Replication for Fault Tolerance and High Availability
Replication is effectively an implementation of redundancy in software systems.
The idea of redundancy is nothing new. Living organisms have redundancy in their subsystems. For example, the human body has two kidneys but it can still function even if one of them can no longer perform its function. Redundancy is heavily used in reliability engineering for mission-critical systems such as flight control and energy grid.
The word ‘redundancy’ has a sour aftertaste to it. If something is redundant, then it should be removed, right? The truth is that keeping some systems or their components redundant — that is, multiple components performing the same function — leads to higher reliability of the system as a whole. In case one of the components fails, its redundant counterpart can ensure that the system continues to function as a whole until the system can safely be brought to maintenance. Redundancy is necessary when the cost of subsystem failure is high, either in monetary terms or in terms of human lives.
The idea of redundancy is taken a step further in the engineering of reliable and highly available software systems. Unlike in hardware systems such as an airplane or biological systems such as the human body, one can fairly quickly provision an additional instance of software and, more than that, additional piece of virtual infrastructure (virtual machine) in the cloud.
Another difference is that some applications[2] accumulate the data and modify their own state. That means that an application will gradually change over time as it gets inputs that are stored and processed. This means that the redundancy in software systems has to be continuously maintained in a sense that all the instances of the software system have the same state although it changes over time. This is what differentiates pure redundancy from replication: the ‘sameness’ of multiple software instances has to be actively maintained over time as data changes and instances of software are added or removed.
In total, we saw two components of availability for stateful applications: the availability of an application service and the data availability. Both are important but not always to the same extent. To have the message delivery service available means to replicate it on multiple servers. If all of them but one go down, the availability of the service can still be maintained depending on the volume of load. In contrast, if the data is replicated only on two servers, then if both of them go down at the same time, the data becomes unavailable regardless of how many more servers with application instances remain.
In the below picture you can see an application with 4 instances spread across 4 servers. It has the data for two clients, client 1 and client 2. This data is partitioned and replicated. The data for client 1 landed on server 1 and server 3. The data for client 2 is on server 2 and server 3. If in an unfortunate turn of events server 1 and 3 have crashed, all the accumulated data for client 1 is lost even though servers 2 and 4 can continue to operate and store new data. Note that the data for client 2 is not lost in this case.
Basically, this specific example shows that our system deployed on 4 servers can survive only one server crash without becoming data-unavailable. If we had put the data for client 1 on another server, say, server 4 or 2, then the application could survive two servers crashing simultaneously without becoming data-unavailable. For such kinds of crashes, we can say that if we want an application to continue serving the user fully and not lose the data even in case of F faults (where F is some natural number), then we need to replicate both the data and the application at least F + 1 times. This number is called a replication factor, RF.
When an application is stateful, i.e. has the data and modifies it as a result of operations (example: message queues in MQTT broker), it is not enough to start multiple applications on multiple nodes and be done. Whenever load comes in and changes the state of one of these instances, effectively, you have two different applications instead of one. Hence, the data needs to be updated on other instances as well so that upon failure the application as a whole could continue to operate as if nothing happened and no data was lost.
Maintaining the ‘sameness’ of the software across multiple instances boils down to maintaining the ‘sameness’ of the data that the software operates on. The challenge here is that the speed of light imposes a hard limit on how fast such updates can propagate in the distributed system. If the data on one instance is updated and the user got their confirmation that the data was updated, this does not necessarily mean that the replica of this data on another server was also updated at the same time. It might be updated with a small delay but it would still be large enough for the client that reconnects to another server to not see its update enforced.
To mitigate such inconsistencies that may be viewed as data loss by some client applications, it is common to introduce roles for replicas and add synchronization mechanisms as part of data flow through the distributed system.
Replica Roles
There are in general two kinds of operations on the data that an application can perform: read and write. Write operations change the data. An example of a write operation could be an addition of a message to the queue of some subscriber. Both removing this message from the queue and marking it as read (!) are also write operations since they modify the state of the data.
The order of write operations matters for some applications. In general, it is not expected that a message will be removed from the queue before it is written to it. Well, not unless we replicate the queue.
Imagine an MQTT broker deployed on two servers. The replication factor is two (RF = 2) meaning that every incoming message is replicated on both servers. Every node has a single publishing client connected to it. A replicated persistent session of some offline subscriber is stored on both servers. The subscriber is subscribed to topics of both publisher 1 and publisher 2. Imagine both publishers send a message at the same time. Because the subscriber is subscribed to both topics, both messages need to end up in its queue. However, the message from publisher 1 lands on node A first; hence, it is placed at the head of the queue. In contrast, the message from publisher 2 lands on node B first and is thus placed at the head of the queue on node B. Then both are replicated and added at the end of the queue on the other node. With that, we have two versions of the subscriber queue! Depending on which node the subscriber connects to, the order of messages that it sees will differ: either m1, m2 or m2, m1.
Such nondeterminism is in general frowned upon in data infrastructure systems and would violate MQTT ordering guarantees. There are multiple ways to make the order of writes consistent across replicas. Some of these ways are more performant for the message-oriented middleware systems (such as an MQTT broker) while others are less. We will focus on those that perform well in high-throughput applications with relatively stable configuration and low cross-nodes latency, e.g. deployed in data centers. To alleviate this nondeterminism, we will have to introduce replica roles; specifically, a coordinator (also known as master or a primary), and a follower (aka backup). Notably, these roles are only introduced in connection with the data or state that the system stores.
A coordinator is responsible for some subset of data (called a shard or a partition). It is responsible for it in a sense that only the replica with this role is allowed to modify this data directly. The followers may only perform the modification based on the command issued by the coordinator. In that sense, the application instance with the coordinator role acts as a sequencer of all the writes that modify the given subset of data.
The followers ‘follow’ the coordinator in a sense that they try to catch up with the state of the coordinator as soon as possible. This can be done in a push or pull fashion. The major difference is in which node performs more work and in timeliness of updates. In this blog we mean the push model when the coordinator proactively sends modification requests to the followers. This model incurs more work on the coordinator but tends to reduce the time that is required to modify the data across the cluster.
In contrast to having dedicated master nodes, the introduction of the coordinator role for a data partition avoids the single point of failure (SPoF) issue that some distributed systems suffer from. Remember that we’ve made the coordinator responsible only for a subset of the data. The role of the coordinator is passed onto one of the followers should the current coordinator go down. This process is known as a failover.
In this case, since followers closely follow the data state of the coordinator, they can pick up from where the coordinator left off almost immediately. Of course, if, by coincidence, the coordinator and all its followers go down at the same time, then the data becomes unavailable regardless of how many more application instances are still operational. Putting this unlikely scenario aside for the moment, let’s consider what happens when one of the replicas fails, regardless of whether it is a coordinator or a follower. Would that mean that the replication factor for the given subset of data will decrease and thus the system will from now on be able to handle one less failure?
Should one of the replicas fail and if there are still remaining application replicas in the system, then one of the application instances will get an additional responsibility over this piece of data that the failed instances was responsible for. In that case, the data needs to be sent over from the remaining replicas (followers or a coordinator, depending on the design of the system) to the instance that became responsible to match the replication factor configured for the system. With that we can say that the replication group, i.e. the set of replicas for the given subset of data, remains unchanged in terms of its size. This becomes true as soon as the new addition to the replication group catches up with its peers from this group.
Let’s see how this design finds its realization in HiveMQ MQTT Broker - a highly available shared nothing stateful distributed system capable of maintaining hundreds of millions of concurrent connections and millions of messages per second of MQTT traffic.
Replication in HiveMQ Broker
There are two general kinds of replication in HiveMQ Broker: live replication and topology-change-induced replication. The topology-change-induced replication is triggered by the topology (composition) change in the cluster, such as adding or removing an instance of this application. It is characterized by substantial data transfer when compared to live replication. In contrast, live replication is performed for every piece of data that enters the cluster, most notably, subscriptions and PUBLISH packets.
Live Replication
To maintain high availability, HiveMQ broker preserves RF replicas of every data item in the cluster as long as it remains possible, that is, the number of servers is at least RF. If the count of nodes drops below the replication factor RF, HiveMQ Broker continues to operate and accepts the data to maintain the continuity of operations for applications that rely on it (application availability). To guarantee that the piece of data will be present in the cluster with the required replication factor (data availability), HiveMQ Broker waits to collect all the replication acknowledgements before acknowledging that the data has been persisted in the cluster to the client. The below diagram shows how an incoming message is replicated in the broker.
Once the publishing client has sent a PUBLISH packet and HiveMQ has determined that there is a subscriber interested in it (step 1), HiveMQ instance (on node 4 in the diagram above) will determine which nodes store the session of the subscriber and will pick the coordinator (aka master) node among them (which turns out to be node 1) and then will send the PUBLISH to it (step 2).
Once the leader of the partition persisted the PUBLISH for delivery, the leader will replicate the message to the followers (node 2 and node 3) in step 3. Once the leader of the partition has collected the acknowledgements of the followers (step 4), it confirms that the message has been persisted to the disk to the node that originally received it and is connected to the client (step 5). In turn, the connected node (node 4) confirms that the message has been persisted with the desired level of guarantees in the HiveMQ cluster to the client.
As we see, live replication requires multiple steps before the client gets a confirmation that the data has been modified. That way HiveMQ Broker ensures high availability of the data present in the cluster. Once the client gets an acknowledgement that the data is stored in the cluster, only the simultaneous removal of RF nodes would lead to guaranteed data loss.
Although the guarantee of the described live replication might seem to be overly strict for less demanding use cases as streaming metrics into the database, the use-cases that require bidirectional messaging and have side-effects in the real world (like opening and closing car doors) benefit from having such guarantees.
Failover
Masterless cluster architecture of HiveMQ Broker where each data partition maps to its own ‘leader’ effortlessly implements failover. The data in HiveMQ Broker is partitioned to ensure horizontal scalability of the broker. When configuring a HiveMQ cluster, one chooses the replication factor (RF) which is basically a number of times the same data partition is stored in the cluster on a new node. The replication factor of 3, for example, means that every data partition is present on 3 servers, one of which becomes a leader for the partition to ensure the ordering of writes, and 2 others become followers. Every node in the HiveMQ cluster is a leader for some data partitions and follower for some others.
What happens if one of these servers goes down? If it is one of the followers that goes down, then HiveMQ cluster detects that the corresponding data partitions are under replicated and thus, if the amount of remaining servers allows it, will send the corresponding data to another node that is next in the so-called hashring — a concept that we’ve explored in great depth in the article Upscaling with MQTT Workloads Through Data Partitioning. In the meantime, while the replication is ongoing, the cluster serves the traffic as usual.
The failure of a master for the partition is also not the end of the world. If a partition master fails, there are still RF - 1 remaining followers with the same or slightly lagging data. HiveMQ broker takes advantage of this fact by promoting the follower that is next to the crashed master in the hashring to be the new master. That way, the broker ensures that the failover to the new master happens as quickly as possible. This allows the cluster to maintain high data availability. In addition, HiveMQ Broker starts replicating the missing data to another node (new follower) to mitigate temporary under-replication as soon as possible.
Conclusion
Availability and fault tolerance set the lower bound on the number of servers in the cluster. Even if the use case or the performance of the application is such that one server suffices, still, fault tolerance drives the number of replicas up to match the desired service and data availability guarantees. Having replication in modern systems is a necessary evil accounting for the fact that hardware is not perfectly reliable at all times. Disks get corrupted, network outages do happen, software updates can wreak havoc, and the data centers can catch fire or get destroyed in an earthquake. Replicating the service and the data is the only known way to cope with the challenges that the real world poses.
Having data partitions replicated in the cluster allows HiveMQ MQTT Broker to provide continuous availability even if the master node for the given data partition is for some reason not available (crashed or cut off due to a network partition). The failover from old master to the new one selected among the followers happens seamlessly and does not present major disruption to the traffic passing through the cluster. Since HiveMQ Broker distributes data partitions across the servers such that no single server contains more than one instance of the same partition, hardware outages are something that HiveMQ Broker can tolerate while providing services and keeping the data safe.
References
[1] Or, more specifically, redundant.
[2] Let’s call them stateful applications.
Dr. Vladimir Podolskiy
Dr. Vladimir Podolskiy is a Senior Distributed Systems Specialist at HiveMQ. He is a software professional who possesses practical experience in designing dependable distributed software systems requiring high availability and applying machine intelligence to make software autonomous. He enjoys writing about MQTT, automation, data analytics, distributed systems, and software engineering.