Monitoring HiveMQ: A Comprehensive Guide
In our blog post covering our customer onboarding process, we discussed — but only briefly — how a customer can monitor HiveMQ after going live and touched upon the importance of what to track in order to achieve measurable success. Since the HiveMQ MQTT platform is a complex, distributed system, it requires monitoring and alerting.
Your monitoring and alerting setup will, of course, completely depend on your use case and requirements. For example, a use case that has no growth plans, is very stable, and has low usage presents different requirements from a high-usage continually growing use case. This article dives into what must be done after a new production environment goes live.
List of Required Capabilities for Operational Personal
The first thing to consider are the capabilities needed — characterizing the best type of person/role — to perform this job efficiently.
Below are some of the traits required:
Knowledge about MQTT (explore HiveMQ University to learn about HiveMQ training and certifications)
Availability
Basic knowledge of how Java works
Understanding of distributed systems
Knowledge of the use case
Ability to set up the monitoring system (ex: Prometheus + Grafana)
End-to-end infrastructure knowledge
Resource Consumption
Like any other system, monitoring the resources is indeed required to understand what is happening.
CPU & Memory
The current setup of the environment discussed in this article is a cluster with 5 nodes. As you can see from the image below, this cluster has a very healthy level of CPU consumption.
We can see a couple of spikes, and higher usage at the beginning of each hour (this is specific to the use case that is happening).
com_hivemq_system_os_global_cpu_total_total
com_hivemq_system_os_global_memory_total -com_hivemq_system_os_global_memory_available
It’s also perfectly acceptable to have a higher consumption compared to what is shown here. Of course, it can also happen that the level of consumption is different between each node. Although that level supposedly should be similar, there are cases where the differences will indeed happen.
For instance:
Having a DNS (round-robin) to route traffic
Using a health API
Related to the architectural infrastructure
or uneven distribution of shared subscription backend clients
What we don’t want is to have a continually high level of CPU/heap/memory consumption for an extended period of time. If the level of consumption remains high for a long time or frequently reaches high levels, that is a good indication that you might need to scale up.
Heap
The heap memory, although a resource, here is closely tied to the Garbage Collector. As you can see from the image below, it is pretty common to have a lot of spikes in the heap memory. This is because the Garbage Collector is kicking in at specific intervals and cleaning the memory.
Two traits that are important to grasp:
A continuous increase in consumption (even if over a long period)
An average consumption higher than 80%
If either of these two traits is not addressed, this can lead to a node loss.
com_hivemq_jvm_memory_heap_max
Tied to the Use Case
We will mention this a couple of times because it is indeed crucial: Knowing your use case is the best way to understand what is happening.
For instance, it doesn’t make sense to focus on monitoring Retain Messages Metrics, if you don’t use them.
Another good example is the connected car use case, where you have the rush hour patterns in the morning and evening, but much less traffic (and resource consumption) at night.
General
Apart from that, some metrics should always be monitored.
Connections/Disconnections
Every MQTT interaction starts with a TCP connection, so it’s essential to see the level of incoming connection attempts in your systems as well as your disconnects. Keeping an eye on incoming connects as well as disconnects that have occurred will allow you to identify problems that may be originating elsewhere. An example here can be a reconnect storm after a network route to the cluster was severed and re-established.
rate(com_hivemq_messages_incoming_connect_count[1m])
rate(com_hivemq_messages_incoming_disconnect_count[1m])
Concurrent Connections
Monitoring the current level of connections will also be a good way to compare the expected number of clients for your use case with the number of those actually connected to the cluster. Here the metric com.hivemq.networking.connections.current
represents the number of standing connections to the MQTT brokers then. Depending on the use case, this number may remain constant for months or even years, or alternatively follow a daily pattern of rising and sinking according to usage by end devices. A factory with a set number of machines would be a likely example for the former, while mobile clients such as vehicles may produce the latter usage pattern.
com_hivemq_networking_connections_current
Subscribe/Unsubscribe
In MQTT, for a client to receive the published packet, it needs, before that, to subscribe to the respective topics.
delta(com_hivemq_messages_incoming_subscribe_count[1m])
rate(com_hivemq_messages_incoming_unsubscribe_count[1m])
Publish Incoming/Outgoing
Remember that, in MQTT, there is a difference between PUBLISH packets getting in and getting out. In the Publish/Subscribe pattern, the number of messages coming in may not be equal to the messages going out.
Although the differences are small, you can see that in the graphics below, the number of incoming messages is smaller than that of outgoing messages.
If, on the other hand, you observe fewer outgoing messages, that is most likely due to not having enough clients subscribing to the respective topics.
Depending on the use case, however, it may be entirely valid to see a large difference between incoming and outgoing messages on a deployment. Imagine a single backend service publishing one message to which all field clients are subscribed. This again confirms that knowing your use case and what numbers to expect is key to gaining value from your monitoring setup.
delta(com_hivemq_messages_incoming_publish_count[1m])
delta(com_hivemq_messages_outgoing_publish_count[1m])
QoS Levels
In most cases, the QoS level used in a packet is linked to the message that is being sent.
For instance, for telemetry data, the Quality of Service level most likely will be QoS level 0, but for a command message that is being transmitted, the level will be QoS level 1 or even 2.
In the images below, you can see exactly this difference being applied.
delta(com_hivemq_messages_incoming_publish_qos_0_count[1m])
delta(com_hivemq_messages_incoming_publish_qos_1_count[1m])
delta(com_hivemq_messages_incoming_publish_qos_2_count[1m])
Dropped Messages
Very specific to the use case, you also need to monitor the Dropped Messages and their Reason.
delta(com_hivemq_messages_incoming_publish_qos_2_count[1m])
delta(com_hivemq_messages_dropped_message_too_large_count[1m])
delta(com_hivemq_messages_dropped_not_writable_count[1m])
delta(com_hivemq_messages_dropped_qos_0_memory_exceeded_count[1m])
delta(com_hivemq_messages_dropped_queue_full_count[1m])
It will show you the number of messages being dropped, grouped by the reason they got dropped. Some of these reasons include “message queue full” (hinting at the client’s queue reaching its limits), “message too large” (the individual message size exceeded parameters and was dropped) or “not writeable” (when the client’s network socket is not accepting traffic).
These metrics are among the first to look at when messages are not arriving as expected, even though connection numbers look healthy. But if you are handling a Smart Manufacturing use case, with stable connections, and a “low message” reason, then you should dig deeper to understand what is going on.
Kafka
If Kafka plays an integral part of your end-to-end message flow, HiveMQ’s Enterprise Extension for Kafka will offer a whole set of additional metrics once it is activated. The most obvious, and of immediate interest, is the number of messages flowing through this route. Again, knowing what to expect gives us the chance to react to extraordinary situations and set sensible alarms to get the administrators’ attention.
Kafka to MQTT
MQTT to Kafka
Overload Protection
HiveMQ has a built-in Overload Protection Mechanism to prevent unexpected spikes in usage causing a depletion of resources available to the cluster, ultimately leading to an outage.
You observe the individual brokers’ Overload Protection level using the following metric com.hivemq.overload.protection.level. It is worth noting that short spikes of individual or even some cluster members can be a normal occurrence during operations. Seeing spikes during topology changes (such as a rolling upgrade) or large client influxes (after a network outage somewhere between clients and brokers) is also expected. Longer-running increases warrant investigation.
com_hivemq_overload_protection_level
Once HiveMQ detects overload, it throttles those clients responsible for a disproportionate amount of load. The number of affected clients can be observed using: com.hivemq.overload.protection.clients.backpressure.active
com_hivemq_overload_protection_clients_backpressure_active
However, there is a scenario where the Overload kicks in, by having messages being queued, due to delays in third-party systems; for example, the database that is storing the permissions.
You can see it in the Extension Executor Tasks.
In some situations, Overload Protection may activate because the broker is unable to perform tasks due to slow response times from external systems. An example of such a scenario would be calls to an external database or other system for authentication. A clear indicator here would be a rise in Extension Executor Tasks, while overall system load (CPU/memory) remains low.
com_hivemq_extension_task_executors_queued
Garbage Collector
As a Java application, HiveMQ’s process is subject to regular garbage collection by the JVM. This process will execute itself and will remove unneeded objects from memory.
There are two types:
Young Generation(com_hivemq_jvm_garbage_collector_G1_Young_Generation_time[1m])
Old Generation(com_hivemq_jvm_garbage_collector_G1_Old_Generation_time[1m])
“Young Generation” will have many spikes, which is intended, and there is no reason to worry as it is a normal process.
“Old Generation,” on the other hand, is a bit different and warrants investigation.
If your deployment is experiencing spikes in OG garbage, we don’t want to do anything that can influence or interfere with the broker.
Automated tasks (such as Backup creation) or others that create high strain on the brokers (such as Rolling Upgrades) should be completely avoided at this point.
delta(com_hivemq_jvm_garbage_collector_G1_Young_Generation_time[1m]) + delta(com_hivemq_jvm_garbage_collector_G1_Old_Generation_time[1m])
Logging Errors
Besides writing ERROR level messages to its log, HiveMQ will also increase the count of the metric com.hivemq.logging.error.total each time it does so. While WARN level messages may occur during regular operation, ERROR level messages are always a reason to take a closer look at what caused them.
sum(increase(com_hivemq_logging_error_total[1m]))
The graph will show the numbers, and also the types of errors that have been occurring.
At this point, you need to go to the logs and understand what is going on.
For additional help, you can go to the documentation or even create a support ticket.
During Upgrades
During upgrades, the first thing you need to do is ensure that you have a stable amount of resource consumption.
Whenever possible, schedule topology changes for time windows where you expect the least traffic on your deployment.
The least disruptive way to perform Rolling Upgrades is first to add the new node, then wait for synchronization and After Join Cleanups to complete. Once done, an old node is removed, and the cycle is repeated until all brokers have been replaced.
If you are forced to perform a Rolling Upgrade while your cluster is under high stress, it is helpful to give the process extra time and wait for com.hivemq.internal.singlewriter.topic.tree.remove.local.queued
to reach 0.
The log message will tell you when replication is complete. This means all data has made it to the necessary targets (and if you lose a broker, the data is still safe). After Join Cleanup tells you when it is done putting, in the right places, all the data it received.
com_hivemq_internal_singlewriter_topic_tree_remove_local_queued
Once this happens, you can remove a node, and repeat this process.
At the same time, you need to monitor the Cluster Nodes, and what each node is reporting.
com_hivemq_cluster_nodes_count
Each broker should report the same amount of cluster nodes before any further topology changes are performed. Any difference would indicate an issue with clustering and require further investigation.
Scaling the Use Case
Long-term monitoring enables meaningful evaluation of the current environment and prediction of future requirements for the environment. Meeting these requirements with adequate scale is foundational to operating a healthy and uninterrupted deployment.
Usually, in autoscaling, one of the configurations for scaling horizontally is tied to resource consumption. If the resource passes a certain threshold, this will activate the addition of another node to the cluster.
We advise against autoscaling, as adding brokers to an already strained deployment comes with an initial front-heavy load as the cluster reorganizes its data. Proactive scaling is the best and safest option.
HiveMQ has a masterless approach. This means that no node is responsible for mastering the whole cluster. Instead, each node will replicate its data to other nodes. Understanding the use case, and working proactively, is the best course of action for monitoring your HiveMQ deployment.
Francisco Menéres
Francisco Menéres is Senior Customer Success Manager – EMEA at HiveMQ. Francisco excels at helping customers achieve their business goals by bringing a unique perspective and a proactive approach to problem-solving. His ability to see the big picture allows him to develop effective strategies and drive success, whether he’s working with people one-on-one or leading a team.