HiveMQ Now Delivers 80% Higher MQTT Throughput
Infrastructure costs tend to creep up over time. When your use case scales up and adds more load onto the application, it gets increasingly painful to add yet another beefy EC2 instance blowing your already strained budget. With all the money that you spend on buying software licenses or subscriptions for various kinds of middleware, like an MQTT broker, it is natural to expect this software to be highly efficient in its resource usage. Unfortunately, most middleware producers tend to excessively focus on features and integrations neglecting the efficiency aspect altogether.
Therefore, we are proud to announce that since HiveMQ 4.18, your many-core instances (32+) can handle almost double the load they were handling before the update. For the same infra costs you now get an additional 80% in the throughput. This means that if you previously had to spend $10,100 monthly to run a cluster of 10 on-demand AWS m6a.8xlarge [1] instances for your use case with a total of 540,000 messages per second pumped through the cluster, since the release of HiveMQ 4.18 you are able cut your cloud spending down to almost $6,055 per month [2]. That increase might appear to be unbelievably high to you, so we want to showcase this improvement and what led to it.
Improving Intra-Cluster Messaging to Boost Throughput
Our HiveMQ 4.18 benchmark with 20,000 clients achieved a total of 270,000 QoS1 PUBLISH packets per second pumped through the cluster of 3 m6a.8xlarge AWS EC2 instances as compared to mere 150,000 messages on 4.17. This x1.8 increase in throughput visible in the graphs below became possible due to several optimizations performed in the core cluster transport of the HiveMQ broker. The evaluation was conducted using an in-house load testing tool HiveMQ Swarm capable of effortlessly running high-scale load-testing scenarios.
To get a better picture of what our engineering team did to offer these improvements for the large-scale use cases, we need to shed light on the core of the HiveMQ broker; that is, the cluster transport.
Cluster transport refers to all the intra-cluster communication (intra means inside, within). This kind of communication differs from the communication between the client and the broker because it has a different purpose. As such, the broker implements MQTT protocol to communicate with the clients to provide them the guarantees and the features of the protocol. However, within the cluster, nodes talk to each other quite a bit differently since the purpose of inter-cluster communication is to ensure highest possible reliability and to cope with possible failures of nodes or of the network, both not being the concerns of MQTT. The below picture emphasizes this distinction of node-to-node and node-to-client communication.
HiveMQ broker implements shared-nothing distributed architecture to ensure high availability and reduce the impact of node failures. This results in sending multiple requests within the cluster for every message that arrives from the client or that gets sent to it. The major cause of that is data replication. In addition, intra-cluster communication has streams for cluster composition management (heartbeats, nodes discovery messages, and so on), collecting metrics, updating the client session information, and various other maintenance functions.
Reliability of intra-cluster messaging is ensured by running all the communication over TCP[3] and adding an extra layer on top of it to survive connection resets due to temporary communication issues in the network. This extra layer functions similarly to TCP but allows to keep the guarantees across multiple instances of the connection between the same pair of nodes. It does so by retransmitting the messages that have not been acknowledged by their respective receiver.
In normal situations, messages should travel smoothly to their destination without needing to be sent again. The data center networks usually have plenty of capacity to handle this. However, during busy times when lots of data is being sent, the network might get overwhelmed, and some messages might take longer to process. If the sender doesn’t receive confirmation that the message was received, it might think the receiver didn’t get it and try to send it again. Eventually, when the network is working well again, the receiver will confirm that it got the message by replying to the sender. This approach works because busy periods are temporary; they eventually end, and the network goes back to its normal, less busy state. A real-world example is when many people use connected car services during their morning commute, causing a high load, but later in the day, when they park their cars, the load goes back to normal.
This is what happens in theory. In practice, we saw something completely different when doing one of the many performance evaluations of the broker in the high load scenarios. As the diagrams below show, the retransmissions turned out to be consuming a whopping 16-20% of the total cluster throughput! Not being content with these results, we started to drill into what could have caused this astonishing surge in retransmissions.
To get down to the technical root cause of this behavior, let’s dive deeper into how cluster transport works in the HiveMQ broker.
Every cluster node (every broker instance) can act both as a sender and as a receiver of messages. Similarly to TCP, the sender assigns a steadily increasing number, also known as a sequence number, to each message and stores these in a local data structure called sent table. On the other side of the communication channel, the receiver (another node) adds these very same sequence numbers, that it extracts from the messages, to its own data structure, a received table. If the receiver node encounters a gap in the sequence numbers of received messages, for example, it received messages with numbers 1, 2, and 5 but didn’t receive messages 3 and 4, then the receiver requests a retransmission for the missing messages. The receiver also acknowledges the highest sequence number up to which it has received all the messages (in that case it is 2).
To send a message to another broker instance, a sender node first adds the message to a local send queue. If this queue is full, the sender has to wait until there is some room when some other message from the queue finally gets sent.
On the other end, the received messages are added to a local receive queue of the receiver node. When this queue gets full, the received messages are dropped.
Once the messages are polled from the receive queue, acknowledgements are added to the send queue to notify the sender that its messages were successfully received.
In high load scenarios, the send queue might quickly become full. The intention behind this design was to avoid running out of memory for the broker. However, the unintended consequence of this design choice and passing the acknowledgements through the same send queue became the receiver’s inability to add the acknowledgements to the send queue once it becomes full:
Not seeing the acknowledgement for the message that it sent, the sender node will attempt to send the message again after the elapsed timeout.
As a result, this will push the receive queue to its limit and will end up dropping messages, and, thus, producing even more retransmissions.
Essentially, the unintended consequence of these design choices is the self-reinforcing feedback loop that produces more retransmissions as a result of the receiver not being able to cope with the vast amount of retransmissions already issued.
To break this cycle, we have redesigned the acknowledgement mechanism to send the acknowledgements asynchronously in a separate task. As a result, the receive queue gets drained much more consistently. We have also dropped the upper limit of the receive queue and modified the overload protection to take into account the current size of the receive queue. This avoids indefinite growth of the receive queue (which may lead to running out of memory) and at the same time keeping messages that have been received already but for which the acknowledgement has not yet arrived at the sender node. The modified design is shown in the below picture.
With that design implemented, we saw retransmissions literally plummeting to zero in 4.18 and the network bandwidth being used solely for the purpose of pumping the actual messages. This new design allows our customers to enjoy an additional 80% in throughput at no additional cost.
The only thing that one needs to be careful about when moving from HiveMQ 4.17 to 4.18 in the deployments with many cores is to ensure that the downstream systems, like databases and analytical applications, could handle the increase in the load that they will be getting from the broker. If that happens to be the case for your use case, then the subscribing clients can limit the amount of PUBLISH messages for QoS 1 and QoS 2 that they are willing to process using the receive maximum property of the CONNECT packet.
Experiencing the Benefits of Higher Throughput
HiveMQ’s cluster communication enhancements are poised to make a significant impact on the efficiency and cost-effectiveness of business operations. With an 80% increase in the total throughput, these improvements go beyond mere optimizations; they directly translate into tangible benefits for an organization with the most demanding use cases.
For an enterprise architect tasked with supporting a use case of hundreds of thousands to millions of connected devices on a tight time budget while keeping the operational costs reasonable, this means finally having the right tools and technology for this difficult task. Whether you’re managing complex enterprise systems or overseeing mission-critical IoT deployments like connected car entertainment, HiveMQ equips you with the means to achieve more while spending less.
To learn more about using MQTT in production and how the engineering team improves the HiveMQ broker to fit with your most demanding use cases, subscribe to our newsletter below. The presented results and the blog are the outcome of HiveMQ SAR team’s collective efforts, and we’re excited to share our insights with you.
References
[1] Although m6a.8xlarge EC2 instances offer 128 Gb of memory each, the throughput improves thanks to 32 virtual cores available on these instances. The sheer amount of memory available on these instances is not necessary to achieve the same result. ︎
[2] This is a back-of-the-napkin computation; you should evaluate your use case in a production environment. ︎
[3] The HiveMQ broker can also run on top of UDP but it is not recommended for the use cases that require reliable delivery of the messages. ︎
Dr. Vladimir Podolskiy
Dr. Vladimir Podolskiy is a Senior Distributed Systems Specialist at HiveMQ. He is a software professional who possesses practical experience in designing dependable distributed software systems requiring high availability and applying machine intelligence to make software autonomous. He enjoys writing about MQTT, automation, data analytics, distributed systems, and software engineering.