Masterless Clustering of MQTT Broker for Business Continuity
The HiveMQ platform offers a diverse set of integrated software components to move and transform data. The Platform supports business continuity and integrates with the existing systems in domains varying from connected vehicles to smart manufacturing.
The beating heart of the HiveMQ platform is a clustered MQTT broker, a shared-nothing distributed system with masterless architecture. The unique design of the HiveMQ broker allows it to support enterprise-grade availability standards, ensuring uninterrupted operations even in the case of infrastructure failures. In the following paragraphs we will delve into how key architectural choices contribute to the HiveMQ broker’s superior ability of maintaining the data flow without interruptions.
Let’s start with the basics.
High Availability with Clustered MQTT Broker
What is a clustered broker and why does it matter to have one? The second question helps to answer the first one, so let’s focus on it first. A clustered broker emerged to address the market demand of delivering data reliably, that is, without data loss and within a reasonable amount of time. Regardless of how good modern software and hardware is, this promise can hardly be fulfilled by any broker running on a single server. If there is an infrastructure outage or something goes awry in the underlying systems-level software, like a virtualization platform or an operating system, then the service will be gone and the business process will stall. Although modern MQTT brokers allow storing data to disk[1] to avoid data loss upon restart, the end users will still notice the downtime if the broker runs on a single server.
The capability of the software system to continuously serve load while fulfilling the defined quality requirements despite failures and outages is frequently referred to as availability. Availability is achieved by designing the system in a way that allows it to tolerate various kinds of faults, that is, provide the desired level of service even if something goes terribly wrong.
What would one do to make the system available? First answer that may pop into one’s mind: avoid faults and bugs. However desirable, this ‘solution’ is highly implausible. Effectively, one would need to ensure that the entire hardware and software stack as well as the surrounding infrastructure (such as power grid) is 100% reliable and will not succumb to any kind of failures, including earthquakes, floods, and cosmic radiation. Sounds like a tall order, doesn’t it? Because it is.
A far cheaper and quite a bit more manageable alternative is widely employed in mission-critical systems such as airplane and nuclear power plants control systems - redundancy. Mission-critical systems implement redundancy by having several duplicates of the same device running in parallel and performing the same set of operations.
For example, flight control systems may employ 2-3 processor pairs to perform airplane control tasks. Even the older natural systems such as the human body exploit this idea by offering two hands instead of just one. So, to get higher availability one has to add some servers with the same software running on them and make sure that the data is replicated across them, that is, each server contains a copy of the same data. In that case, even if a server goes down, there will be another one that is ready to serve within a short amount of time. This switch from a failed server to a healthy one is called failover. The systems that relied on the crashed server continue to operate because they will reconnect to a healthy one hoarding the data that they need.
Redundancy comes with a price tag. Instead of operating just one program, one would have to operate multiple copies of the same program that share the data. Effectively, any software system has to become distributed in order to cope with all kinds of possible outages and service disruptions. At the same time, the distributed system, like the HiveMQ broker, needs to maintain a perception of a single highly available and reliable program. In fact, the use cases relying on the broker only care about the services that it provides and the guarantees on different characteristics of these services, such as the number of times the same message is delivered to the client. The fact that the HiveMQ broker has a degree of redundancy to ensure fault-tolerance is completely transparent to users.
To sum things up, the broker needs to be distributed but at the same time pretend that it is just a single program. One may be tempted to call this behavior disingenuity but in distributed systems design this is called clustering.
Clustering supposes multiple copies of the same system jointly maintaining the desired level of functionality even in face of failures and acting as a single entity. Thus, HiveMQ broker supports clustering to achieve high availability via redundancy.
Stateful Broker for Higher Durability
To ensure message delivery from publishers to subscribers with guarantees specified in the protocol, an MQTT broker has to be stateful. A stateful software system is a system that keeps and maintains the data even if clients disconnect (persistent sessions) and failures occur (server going down). An opposite of a stateful system is a stateless system. Stateless system does not keep the information about the connection. Once the connection is gone, the information about it is lost. This is useful for some applications, like, for example, simple business card-like websites whose sole purpose is to provide the same information to every visitor. By design, stateless systems are easier to support and maintain as compared to the stateful software.
HiveMQ broker is stateful. It keeps information about the connections, including the messages to be delivered. The designers of the stateful software systems follow one of the two major roads to achieve fault-tolerance. One option is to put the data into some decoupled shared storage, ideally, another distributed system that supports redundancy. Another option is to keep the state on every server that runs software, effectively incorporating a distributed storage layer into the broker. Both approaches have their benefits and drawbacks and are chosen according to the specifics of the applications relying on the broker.
The shared storage approach is generally implemented by persisting the data in some third-party storage system that is separate from the MQTT broker. For example, one may utilize AWS S3 or more specialized solutions like Amazon DynamoDB to persist the replicated state. Selecting the shared storage approach helps to accelerate horizontal scaling of the system (i.e. adding and removing servers), for example, in case of autoscaling to match an increase or reduction in load. However, following this route will likely reduce availability because the functionality of the system would depend on yet another component and connectivity to it. Minimization of the moving parts is a practical and well-known strategy to increase the resilience of a complex system. In addition, reads and writes to the decoupled shared storage may not be sufficiently fast for the given use case. Speed of these operations may also vary a lot depending on the cloud platform and the shared storage of choice. For the self-hosted deployment, one would end up operating two systems instead of one which adds visible operational burden [2].
Being considerate of the diverse customers’ use cases and their strict requirements, HiveMQ designed its MQTT broker to follow another well-trodden path in systems software design, known as a shared-nothing distributed architecture with distributed storage layer embedded into the broker itself. Shared-nothing architecture requires the data to be available locally, on the server (its attached storage) [3]. This is a boon for both availability and performance of the system during normal operation which happens for the most of the time that the system is in use.
As with almost everything in engineering, the advantages of shared-nothing architecture with the embedded distributed storage layer come with a catch.
First of all, one has to account for longer cluster composition changes since the data has to be sent over the network and persisted to the disk when a server is added or removed to or from the cluster. HiveMQ broker uses a clever data distribution algorithm that reduces this time on average to the practically-tolerable minimum of a few seconds to several minutes depending on the load.
The next catch is that keeping all the necessary data locally may strain storage resources available on the server. In practice, this rarely happens since the majority of the data that materializes on the broker’s drives is transient, meaning that it only temporarily stays in subscribers’ queues and then leaves the broker. On top of that, the capacity of the modern hard drives surpasses the volumes that even the most demanding use cases are capable of generating at the moment of writing.
Last but not least, shared-nothing architecture is notorious for the toll that it produces to keep the data consistent. Remember redundancy? We have replicated the broker and the data across multiple machines but we want it to behave as a single entity. So, we want all the machines to have exactly the same data at each point in time. Unfortunately, we live in a world where the speed of light has a limit, hence immediate updates across multiple locations are not possible [4]. Implementing and maintaining shared-nothing distributed systems on the enterprise level of quality is a challenging engineering exercise but the numerous advantages that this architecture offers to support business continuity of our customers make HiveMQ follow this thorny path. A very peculiar challenge that one faces on this path is to keep the data as consistent as possible for as long as possible.
Data Consistency: an Elephant in the Room
When introducing replication for fault tolerance, one stumbles upon keeping the state of data consistent and reflecting the order of operations that change it correctly on all the servers. What does consistency mean in distributed systems and why is it important? To avoid a lengthy academic discussion, let’s look at a plain example of a distributed system - a crossroads with a traffic light. Imagine an intersection of two roads, one goes from north to south, another from east to west, both have two lanes. The crossroads is equipped with four traffic lights, each facing one of the roads before the junction (see the image below). The arrows starting at the traffic light depict the direction that it faces.
If the distributed system of four traffic lights functions correctly, then:
The state of two traffic lights that face North and South is consistent with each other, specifically, is the same at every moment in time
The state of two traffic lights that face West and East is consistent with each other, specifically, is the same at every moment in time
The state of two pairs of traffic lights, first pair being traffic lights managing North-South road and second pair being traffic lights managing West-East road, is consistent with each other in a sense that when one pair shows the green light, the other one shows the red light
These conditions define what it means for the state of traffic lights on the crossroads to be consistent if we consider the crossroads and the traffic lights to be a distributed system.
Now, what does it mean for this system to be inconsistent? Let’s consider an example when both the first and the second conditions are violated. As a consequence, the third condition then also becomes violated. For instance, the West-facing traffic light could show red while the East-facing traffic light shows green. If the East-West road was isolated, this wouldn't be a severe issue since it would mean that the cars coming from West are stuck in a traffic jam while the cars coming from the East could proceed safely. However, if there is a crossroads and we set the North traffic light to show red, and, more importantly, the South-facing traffic light to green, then we may get a collision of cars coming from the East with the cars coming from the South, see the image below.
As one can see, inconsistency in the traffic lights regulating a crossroads can have pretty dire consequences. Similar (although, in most cases, less dramatic) effects can take place in the stateful software systems that process the data if the consistency is violated. Specifically, if the replicated data is being processed by two servers independently (maybe a network is down between them so they don’t see each other), then the data may end up being in a different state depending on which server it is on. Consistency is especially important in banking applications where the consequence of inconsistency could be double booking from the account or bank effectively generating money for its customers out of thin air.
Having taken a peek into the consistency problem, let’s return to the HiveMQ broker and study how it ensures consistency of the data that passes through it.
Masterless Clustering for Better Performance and Higher Resilience
In a distributed shared-nothing architecture that the HiveMQ broker employs, consistency of the data can be achieved in multiple ways. One way is to designate a server or a set of servers to have a so-called sequencing responsibility. Sequencer orders operations on the data, most importantly, writes that modify the data, but it can also order reads depending on the use case. The idea is that all the operations that pass through the sequencer will be applied in the same order on all the replicas of the data in the distributed system.
Although the architecture with a designated sequencer is intuitively clear, it has a few significant drawbacks. Firstly, it introduces additional latency to all the operations that pass through it. Next, it becomes a bottleneck in the system since at least all the writes have to pass through it; failure of one of the sequencer nodes would affect the whole system. Last but not least, having a separate sequencer results in additional operational burden. Whenever one sees the special-purpose servers in the system, one needs to be aware that the system also adopts all the listed disadvantages. In the figure below, the special-purpose master nodes fulfill the role of the sequencer and order all the data modification operations that are replicated on followers.
Modern distributed systems, such as Kafka and the HiveMQ broker, attempt to get rid of external sequencers (master nodes) by embedding sequencing functionality and allowing every server to perform it on a subset of data. Effectively, this makes every HiveMQ broker instance self-reliant since each can perform independent sequencing of operations. This kind of architecture is known as masterless architecture.
Masterless architecture of a distributed system can be implemented in multiple ways but all of them boil down to solving some form of the well-studied consensus problem. One of the most robust ways is to search for consensus on every single write operation that passes through the cluster. Although in some applications this might be acceptable, in high-throughput low-latency MQTT use cases searching for agreement on every write operation turns out to be unacceptably slow.
As a practical and feasible alternative, HiveMQ broker embraces the model of partial synchrony that is to a large extent applicable to the applications deployed in data centers with high-bandwidth networks. Having this model implies that for most of the time the system behaves as a synchronous system, that is, every message sent will arrive at its destination within some fixed amount of time. This assumption turns out to hold in practice and thus allows to increase efficiency of solving the consensus problem. Adopting partial synchrony allows kicking the consensus ball down the road. Instead of running instances of the consensus protocol for every operation, the system may search for consensus on cluster composition or on the roles of nodes (which node is responsible for which data range). The crux of this approach is that the consensus must only be performed when the cluster composition changes. Such changes occur on the order of magnitude larger timescale as compared to the arrival rate of new messages. For example, scaling the deployment up by adding new broker instances and down by removing them to accommodate spikes in load happens in the morning and evening hours (twice a day) for end-user applications whereas thousands of messages continue to arrive every second.
HiveMQ implements masterless architecture by performing consensus only for the cluster membership in the sense of servers having the data that they are responsible for (as a master). Once all of the nodes agree on the cluster composition, there is no need to run the consensus for every operation. Every server becomes independent in its operations over some range of identifiers, for example, unique client identifiers (or, more broadly, keys). That means every HiveMQ instance serves as a de-facto master over a range of keys, so every server is effectively a master and thus none of them is a master for all the keys. Unless the replication factor is set to 1 (which means no replication and, thus, no fault-tolerance), every node combines the role of master for one range of keys with the role of a follower for some other range of keys. The follower does not serve writes issued by the client directly; only the master of the identifiers range can do that to the follower. The below picture schematically shows a masterless HiveMQ cluster with the replication factor of 3 (each data entity has 3 copies in the cluster), that is, every server is a master for one range of data and a follower for two other ranges of data.
The masterless architecture approach allows the cluster to grow indefinitely in size. Even though for the purpose of clarity we do talk about HiveMQ instances as if they have master and follower roles, it is important to understand that in the masterless architecture these roles are temporary and are bound to some key ranges, they are not universal. The main implication is that regardless of which servers crash, as long as there is at least one server remaining, the HiveMQ broker will continue to operate. Thus, the masterless broker architecture remains the best choice for the use cases that rely on guaranteed continuous data movement.
References
[1] A common alternative in the older generation of brokers is storing the messages in main memory. Although keeping messages in main memory allows for very low latency, the implication is that the data will inevitably be lost upon the broker restart. HiveMQ broker offers durability by storing both the messages and subscriptions to the disk.
[2] Added operational toll and the costs of running two systems instead of one were two major determinants for Kafka to drop Zookeeper in favor of its own embedded kRaft protocol.
[3] This does not mean that shared-nothing systems store all of their data on every server. In practice, the data is first partitioned (split into subsets) and then replicated. For the sake of simplicity, we do not discuss partitioning in the broker.
[4] We will not dive into quantum entanglement effects and their applicability to distributed systems in this article.
Dr. Vladimir Podolskiy
Dr. Vladimir Podolskiy is a Senior Distributed Systems Specialist at HiveMQ. He is a software professional who possesses practical experience in designing dependable distributed software systems requiring high availability and applying machine intelligence to make software autonomous. He enjoys writing about MQTT, automation, data analytics, distributed systems, and software engineering.