Cracking MQTT Performance with Automation: Challenges and Approaches
Every new generation of software has its flaws. MQTT brokers are no exception. For an application engineer, whose application relies on the broker, there is hardly a bigger disappointment to see a throughput dipping after a successful rollout of the version with a long-awaited feature. Or the CPU maxing out under the same load. Add to this a hassle of deploying larger instances and inflating your infra budgets. Painful yet if the piece of software is mission-critical, so you cannot dump it. Sounds familiar?
Knowing what reliable performance means to our users, HiveMQ engineers continuously assess HiveMQ Platform using automated system benchmarking, a rich set of tools and benchmarking cases enabling 360-degree-evaluation of the platform in all things performance.
Without further due, let’s dive into the challenges that HiveMQ engineers had to solve on a thorny path of introducing automated system benchmarking and what benefits did they and the customers reap as a result.
Hand in Hand: Efficiency and Performance
Done with exponential growth, businesses strive to attain operational efficiency. Although the exact definition might be nuanced and even muddied, the core idea is that of producing more with less. Achieving operational efficiency tightly binds into increasing performance.
Performance characterizes capability of a system, be it an organization, a manufacturing line, a car, or a piece of computer software, to perform its task. Performance is always relative to the use case. It cannot be defined universally and at the same time unambiguously.
Consider software that pushes metrics from a fleet of vehicles to the cloud. Every software can be characterized with multitude parameters[1]. But should it? Would all of these parameters matter equally? Would memory usage be a useful performance proxy for the connected vehicles application? Most likely, not. High memory usage on its own does not tell much. If the application shoulders millions of connections and maxes out memory, then it is probably okay. However, if the memory is maxed out and you see only a timid hundred connections on the dashboard, then you might start asking questions.
Not every characteristic with a performance flavor can characterize performance in the context of your use case.
Continuing with the connected cars example, we assume that an MQTT broker delivers the data[2]. Depending on the use case, some performance parameters of the broker may become more important than others. Let’s take a bird’s eye view on performance first.
System Performance vs Client Performance, or Apples vs Oranges
Thinking about performance of a client-server application such as an MQTT broker, it helps to roughly split metrics into two pots: system performance and client performance. To make the clear cut, let’s build on the connected vehicles example.
Imagine that the MQTT broker is used as a central hub supporting the bidirectional communication between clients connected to it. An application running on top benefits from high throughput, high reliability, and low network footprint of MQTT. An application owner will be concerned with the total throughput of the broker, that is, how much traffic per unit of time can be transferred. Large client load makes this concern more pronounced. On the other hand, an application user won’t care about the total throughput. What matters to the user is the time that it takes their message to be moved by the broker.
Throughput characterizes performance of the system at large. Contrast that to the message delay that is relevant only for the specific client. Similarly, on the system level, the application owner would be concerned with how much compute, storage, and network resources the software consumes to offer the desired throughput. On the other hand, a user, who might as well be a separate team within a larger organization, would eye the dropped messages metric closely.
At the beginning of my system design career, I was tempted to think of system performance metrics as more important simply due to the inherent ‘largeness’ factor. In fact, they are not. Despite the division, performance metrics make sense when viewed holistically and in the context of the use case. If a system achieves millions of messages in throughput but fails to deliver every other message, it is a no-go as a backbone for a road traffic monitoring system.
HiveMQ engineers ensure that performance characteristics that matter to the customers are continuously improved and never degrade. That is a challenge because HiveMQ Platform is used in industries ranging from automotive to smart manufacturing. To deliver on our promise, we’ve added automated system benchmarking to our Quality Assurance portfolio.
Automating system benchmarking is vital to continuously get the reliable, scalable, and highly available system out of the door, and straight into the customers’ hands. But how hard could evaluating performance really be? Let’s dive in.
Performance Measurement: The Hard Parts
Measuring performance is not an easy feat. The challenges abound, ranging from deploying the systems to reproducing and analyzing the results. Plenty of performance engineers in the software industry struggle daily to get the automation processes and benchmarks straight.
One of the first obstacles when measuring performance of an MQTT broker is unreliability of publicly available benchmarks. Most public benchmarks are remarkably bursty and short. Workload is generated for a few minutes, and then the benchmark stops. Such benchmarks fail at representing production workloads that HiveMQ engineers come across daily. No professional will use the results of such benchmarks to engineer a better MQTT broker. Production use cases are a different kind of beast as they run, more or less, indefinitely. Using the results of public benchmarks would be akin to optimizing for a very tiny fraction of time that the broker will be in actual use.
HiveMQ engineers design benchmarks to represent the production load of the customers who happily share their setup and load patterns with us. These benchmarks subject the broker to continuous load with occasional fluctuations. HiveMQ engineers avoid optimizing for non-existent workloads at customers’ expense.
Having set our foot into the world of performance testing, we immediately stumble upon load generators incapable of making the system under test[3] break a sweat. But why is that important to drive software to its limits? HiveMQ Platform is horizontally scalable, which means an ever increasing load can be handled by simply throwing more HiveMQ instances at it. Theoretically, the only kind of limit to horizontal scalability is monetary. Scaling the system under test demands the load testing tool to scale as well. A load tester running on a timid lonely c5.large AWS instance with 2 virtual cores and 4 GiB RAM will be extremely unlikely to entertain HiveMQ Broker clustering on a good three c5a.8xlarge AWS instances with 32 virtual cores and 64 GiB RAM. To accurately load-test a horizontally scalable distributed broker, the load tester itself has to be horizontally scalable, and, by extension, distributed. Unfortunately, most public benchmarking tools and evaluation scripts on Github do not offer such luxury. They leave the user to develop the orchestration framework from the ground up. This might not be the best way to spend your time, if you want to simply get credible performance metrics on your use case.
Scaling the system under test demands the load testing tool to scale along with it. HiveMQ developed and uses HiveMQ Swarm, a highly scalable and robust load testing tool that distributes thousands upon thousands of load-generating MQTT clients across a fleet of VMs.
Tooling chosen; the next step is deployment. A rare engineer would be prepared to go through all the intimidating chores of making the system run from the onset. I surely was not. It took some practice, some reading, and answering quite a few questions to get there. First of all, where do you deploy? Cloud? If yes, which? AWS, Azure, Google Cloud, Open Stack, Alibaba are all at our services. Going for all at once might be a costly pursuit, set aside possible lack of expertise. Terraform and Kubernetes barely make it easier, although they do make things easier[4]. Correctly setting all the security configurations, networking, connections between the load generating fleet and the broker might get one busy for a while. Once the cloud setup is done, is it really all over? What do we do with non-cloud on-prem environments that some of the customers still use? To ignore or to reproduce? There are hundreds of open questions when one delves deeper into the fascinating world of deployment environments and their configurations.
Over the years, HiveMQ engineers developed and continuously refined a rich set of tools built on top of Terraform to automate deployment of the broker, the load testing fleet, and the monitoring infrastructure. These testing facilities are started with a single command, allowing the engineers to focus on the load scenarios and configurations to test.
Once the last tantrum of deployment is solved, an engineer faces the challenge of sizing. The engineer has to match the hardware that the broker will run on with the requirements of the load test. Evaluating throughput for millions of connections with bulky messages of tens of kilobytes on the infrastructure with inadequate network performance will ruin the results. If a hardware resource becomes a limiting factor (a.k.a. a bottleneck), then it might hide the software performance issues instead of revealing them. To mitigate that, an engineer would run multiple iterations of the same test to get a feeling of where the hardware bottlenecks for their system are. The results navigate the selection of the appropriate infrastructure. A performance dip during the test caused by a subpar performance of the disk drives is far from helpful or, more accurately, is harmful[5].
Performance bottlenecks of the underlying infrastructure produce misleading benchmarking results. HiveMQ engineers make preliminary tests to determine the right size of the infrastructure to deploy the tests. Emphasizing reliability and speed, HiveMQ Broker leans heavily on the disk and network performance to match the processing speed of the broker.
Our preparations are not complete unless the performance data can be produced by the evaluated setup. An engineer gets the information about the test outcome by instrumenting software. What this boils down to in practice is to add code snippets capturing the metrics of interest. Formulating accurate questions about the broker's performance helps in finding the right metrics. How many PUBLISH messages sit in the queue? How many subscriptions are created with the broker and kept on it? How many messages were fired at the broker and how many of them were delivered? Engineers add metrics to software to answer these questions. Such broker- and test-specific information cannot be obtained via generic OS-level metrics. Despite similarities in MQTT brokers’ functionality, their metric sets might differ a lot both in their naming and how they are measured. Some may report throughput for only the inbound messages while others may sum inbound with outbound MQTT traffic (double accounting), yet others may sum it with acknowledgements that do not carry any payload, which muddies the numbers even further.
Without observing a full performance picture, one cannot do informed performance optimizations. HiveMQ engineers instrumented the broker with hundreds of metrics. Originally introduced to support HiveMQ customers, some of these metrics became used for assessing performance of HiveMQ Platform and continuously improving it in a data-driven way.
It is not enough to produce the results, they have to be stored somewhere. Metrics, including the ones used for performance, are represented by time series. Similar to how sensors do their readings, each measurement in a time series is a discrete point in time. A collection of such points provides insights into the system behavior over time. Consider starting HiveMQ Broker. Since the broker has to perform an initial configuration at booting, we should expect intense resource usage showing up as spikes of CPU utilization and maybe disk accesses. After some time these metrics subside in the absence of load. Connecting clients, subscribing to topics, publishing messages will again drive metrics up. Observing changes over time allows engineers to discover potential to improve.
Measuring performance is half the story, observations need to be stored for analysis. Every benchmark run HiveMQ engineers perform results in vast amounts of data stored in InfluxDB, a time series database with rich querying functionality. Seamless integration with Grafana, a visualization tool with an aptitude for building graphs, makes it the default choice for storing performance data.
Some software systems are destined to exist for years or even decades, perhaps, even centuries. HiveMQ Broker is well within this category by now. A lengthy life cycle poses the challenge of retaining history of performance results. Historical data offers baselines to evaluate how the new changes affect software performance. This knowledge allows engineers to reduce feasibility risks of performing changes to the mature product, such as HiveMQ Broker.
Historical data on software performance enables informed decisions on how to evolve software. HiveMQ engineers implemented automation to store the results of performance tests in a condensed form, available for easy access in Amazon Athena. The analytics capabilities of AWS Athena support the engineers in evaluating how the newer versions of HiveMQ Broker fare compared to the older ones.
Benchmarks on shared hardware of cloud providers are prone to all kinds of noise[6]. Our virtual machines run on the same server, competing for its resources, most notably, for processing time. VM schedulers of hyperscalers, such as Borg[7], run on the assumption that no deployment needs all the processing time all the time (pun intended). Indeed, even the most demanding user-facing applications are subject to varying load with peaks followed by droughts. Providing capacity as if the load is always at peak is wasteful and would increase provider’s costs. Oversubscribed processing capacity of hyperscalers makes performance results of a single benchmark execution unreliable. Even on an isolated system we will get varying performance results due to nondeterminism inherent to OS scheduling and packet delivery over the network.
Scientists came up with the solution a long time ago - just repeat the experiment. When repeating, one won’t get the same results. Each new run will produce a slightly different number. In one instance of the test, the collected metric could be 75%, in another - 79%, yet in some other - 71% or maybe even 68%. Instead of one result, we have a collection. Although we cannot measure the absolute performance like that, we could average the results or talk about performance distributions and bounds on performance. It is often enough to know that the metric jumps between two values most of the time. This knowledge helps both customers and engineers to shape expectations about the system.
Benchmarking software in the cloud is prone to the noisy neighbor problem[8]. To be confident in their software design decisions, HiveMQ engineers run the same benchmark multiple times to collect bounds and variance of performance metrics.
Any kind of benchmarking results are meaningless without interpretation. Endowing the results with actionable meaning can be achieved with analysis. In turn, analysis can be automatic or manual.
The allure of automatic analysis is in its speed and minimal manual effort. The most straightforward approach to automating the analysis stage is to use the verification rules. A verification rule could be as simple as: ‘If the broker demonstrates throughput per core to be lower than 1000 MQTT PUBLISH packets per second, then raise an alarm!’ Automatic verification can also backfire in a very ugly way. It has potential both to overreact as well as to underestimate the performance regressions. This happens because at some point the automation needs to make a binary go/no-go decision on the evaluated code change. To raise an alarm or not to raise: that is the question. Even with the statistical techniques in place, one has to make a judgment call on what is an appropriate range for the metric. The common practice is to set such parameters conservatively, that is, to err on the side of introducing false positives[9]. Alarms go off even if the performance of the system is actually fine. The other extreme - manual analysis - is more comprehensive but is also more time consuming. Errors may also slip into it as attention is a limited resource.
What do we do then if both methods have their own challenges? Combine them!
Benchmarking results have to be interpreted to decide on software improvements. HiveMQ engineers use automatic verification on all kinds of metrics to alarm if an additional manual analysis is required. The automatic verification is based on strong statistical methodology pioneered by Netflix, a company whose benefits and loss tightly bind to the performance of its video streaming software[10]. The early design stages and prototypes are scrutinized by manual performance tests and analysis of the performance dashboards in Grafana.
HiveMQ engineers successfully solved numerous challenges of performance benchmarking by rolling out the automated system benchmarks. These benchmarks allow to harvest performance insights as fast as possible, iterating through solution prototypes at a breakneck speed. As the end goal is to assure reliability of the customer deployments, also from the performance standpoint, automated benchmarking is a major enabler of innovations to the core technology of HiveMQ Platform.
Stay tuned for part 2 of this blog, where we will discuss the implementation of automated system benchmarks and how they help with performance testing of an MQTT broker.
References
[1] That does not mean that it should be characterized, though.
[2] This is frequently a sound assumption given the superior scalability and reliability of the protocol, as well as its low communication footprint.
[3] System under test (SUT) is a system that is tested to evaluate correctness of its operation.
[4] Some may find this statement controversial, and rightfully so. The learning curve for Kubernetes and Terraform is indeed steep. But, once the experience is acquired, the deployments become less painful.
[5] An actual issue that HiveMQ engineers ran into once with one of the major hyperscalers.
[6] https://arxiv.org/abs/2210.15315
[7] https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/
[9] A false positive is when something is determined to be true whereas in fact it is not true (an incorrect interpretation).
[10] Netflix explains their approach clearly in their tech blog: https://netflixtechblog.com/fixing-performance-regressions-before-they-happen-eab2602b86fe
Dr. Vladimir Podolskiy
Dr. Vladimir Podolskiy is a Senior Distributed Systems Specialist at HiveMQ. He is a software professional who possesses practical experience in designing dependable distributed software systems requiring high availability and applying machine intelligence to make software autonomous. He enjoys writing about MQTT, automation, data analytics, distributed systems, and software engineering.