In our blog post covering our customer onboarding process, we discussed — but only briefly — how a customer can monitor HiveMQ after going live and touched upon the importance of what to track in order to achieve measurable success. Since the HiveMQ MQTT platform is a complex, distributed system, it requires monitoring and alerting.

Your monitoring and alerting setup will, of course, completely depend on your use case and requirements. For example, a use case that has no growth plans, is very stable, and has low usage presents different requirements from a high-usage continually growing use case. This article dives into what must be done after a new production environment goes live.

List of Required Capabilities for Operational Personal

The first thing to consider are the capabilities needed — characterizing the best type of person/role — to perform this job efficiently.

Below are some of the traits required:

Knowledge about MQTT (explore HiveMQ University to learn about HiveMQ training and certifications)
Availability
Basic knowledge of how Java works
Understanding of distributed systems
Knowledge of the use case
Ability to set up the monitoring system (ex: Prometheus + Grafana)
End-to-end infrastructure knowledge

Resource Consumption

Like any other system, monitoring the resources is indeed required to understand what is happening.

CPU & Memory

The current setup of the environment discussed in this article is a cluster with 5 nodes. As you can see from the image below, this cluster has a very healthy level of CPU consumption.

We can see a couple of spikes, and higher usage at the beginning of each hour (this is specific to the use case that is happening).

System CPU used - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment com_hivemq_system_os_global_cpu_total_total

System Memory Used – KPI for Monitoring HiveMQ MQTT Broker in Production com_hivemq_system_os_global_memory_total -com_hivemq_system_os_global_memory_available

It’s also perfectly acceptable to have a higher consumption compared to what is shown here. Of course, it can also happen that the level of consumption is different between each node. Although that level supposedly should be similar, there are cases where the differences will indeed happen.

For instance:

Having a DNS (round-robin) to route traffic
Using a health API
Related to the architectural infrastructure
or uneven distribution of shared subscription backend clients

What we don’t want is to have a continually high level of CPU/heap/memory consumption for an extended period of time. If the level of consumption remains high for a long time or frequently reaches high levels, that is a good indication that you might need to scale up.

Heap

The heap memory, although a resource, here is closely tied to the Garbage Collector. As you can see from the image below, it is pretty common to have a lot of spikes in the heap memory. This is because the Garbage Collector is kicking in at specific intervals and cleaning the memory.

Two traits that are important to grasp:

A continuous increase in consumption (even if over a long period)
An average consumption higher than 80%

If either of these two traits is not addressed, this can lead to a node loss.

Heap – KPI for Monitoring HiveMQ MQTT Broker in Production com_hivemq_jvm_memory_heap_max

Tied to the Use Case

We will mention this a couple of times because it is indeed crucial: Knowing your use case is the best way to understand what is happening.

For instance, it doesn’t make sense to focus on monitoring Retain Messages Metrics, if you don’t use them.

Another good example is the connected car use case, where you have the rush hour patterns in the morning and evening, but much less traffic (and resource consumption) at night.

General

Apart from that, some metrics should always be monitored.

Connections/Disconnections

Every MQTT interaction starts with a TCP connection, so it’s essential to see the level of incoming connection attempts in your systems as well as your disconnects. Keeping an eye on incoming connects as well as disconnects that have occurred will allow you to identify problems that may be originating elsewhere. An example here can be a reconnect storm after a network route to the cluster was severed and re-established.

CONNECT per second - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment rate(com_hivemq_messages_incoming_connect_count[1m])

DISCONNECT incoming MQTT message per second - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment rate(com_hivemq_messages_incoming_disconnect_count[1m])

Concurrent Connections

Monitoring the current level of connections will also be a good way to compare the expected number of clients for your use case with the number of those actually connected to the cluster. Here the metric com.hivemq.networking.connections.current represents the number of standing connections to the MQTT brokers then. Depending on the use case, this number may remain constant for months or even years, or alternatively follow a daily pattern of rising and sinking according to usage by end devices. A factory with a set number of machines would be a likely example for the former, while mobile clients such as vehicles may produce the latter usage pattern.

Current Connections stacked - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment com_hivemq_networking_connections_current

In MQTT, for a client to receive the published packet, it needs, before that, to subscribe to the respective topics.

SUBSCRIBE in - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment delta(com_hivemq_messages_incoming_subscribe_count[1m])

UNSUBSCRIBE - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment rate(com_hivemq_messages_incoming_unsubscribe_count[1m])

Publish Incoming/Outgoing

Remember that, in MQTT, there is a difference between PUBLISH packets getting in and getting out. In the Publish/Subscribe pattern, the number of messages coming in may not be equal to the messages going out.

Although the differences are small, you can see that in the graphics below, the number of incoming messages is smaller than that of outgoing messages.

If, on the other hand, you observe fewer outgoing messages, that is most likely due to not having enough clients subscribing to the respective topics.

Depending on the use case, however, it may be entirely valid to see a large difference between incoming and outgoing messages on a deployment. Imagine a single backend service publishing one message to which all field clients are subscribed. This again confirms that knowing your use case and what numbers to expect is key to gaining value from your monitoring setup.

PUBLISH incoming and outgoing - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment delta(com_hivemq_messages_incoming_publish_count[1m])

delta(com_hivemq_messages_outgoing_publish_count[1m])

QoS Levels

In most cases, the QoS level used in a packet is linked to the message that is being sent.

For instance, for telemetry data, the Quality of Service level most likely will be QoS level 0, but for a command message that is being transmitted, the level will be QoS level 1 or even 2.

In the images below, you can see exactly this difference being applied.

Incoming QoS 0 - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment delta(com_hivemq_messages_incoming_publish_qos_0_count[1m])

Incoming QoS 1 Publish - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment delta(com_hivemq_messages_incoming_publish_qos_1_count[1m])

Incoming QoS 2 Publish - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment delta(com_hivemq_messages_incoming_publish_qos_2_count[1m])

Dropped Messages

Very specific to the use case, you also need to monitor the Dropped Messages and their Reason.

Dropped messages by Reason - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment

delta(com_hivemq_messages_incoming_publish_qos_2_count[1m])

delta(com_hivemq_messages_dropped_message_too_large_count[1m])

delta(com_hivemq_messages_dropped_not_writable_count[1m])

delta(com_hivemq_messages_dropped_qos_0_memory_exceeded_count[1m])

delta(com_hivemq_messages_dropped_queue_full_count[1m])

It will show you the number of messages being dropped, grouped by the reason they got dropped. Some of these reasons include “message queue full” (hinting at the client’s queue reaching its limits), “message too large” (the individual message size exceeded parameters and was dropped) or “not writeable” (when the client’s network socket is not accepting traffic).

These metrics are among the first to look at when messages are not arriving as expected, even though connection numbers look healthy. But if you are handling a Smart Manufacturing use case, with stable connections, and a “low message” reason, then you should dig deeper to understand what is going on.

Kafka

If Kafka plays an integral part of your end-to-end message flow, HiveMQ’s Enterprise Extension for Kafka will offer a whole set of additional metrics once it is activated. The most obvious, and of immediate interest, is the number of messages flowing through this route. Again, knowing what to expect gives us the chance to react to extraordinary situations and set sensible alarms to get the administrators’ attention.

Kafka to MQTT
MQTT to Kafka

Overload Protection

HiveMQ has a built-in Overload Protection Mechanism to prevent unexpected spikes in usage causing a depletion of resources available to the cluster, ultimately leading to an outage.

You observe the individual brokers’ Overload Protection level using the following metric com.hivemq.overload.protection.level. It is worth noting that short spikes of individual or even some cluster members can be a normal occurrence during operations. Seeing spikes during topology changes (such as a rolling upgrade) or large client influxes (after a network outage somewhere between clients and brokers) is also expected. Longer-running increases warrant investigation.

Overload Protection Level - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment com_hivemq_overload_protection_level

Once HiveMQ detects overload, it throttles those clients responsible for a disproportionate amount of load. The number of affected clients can be observed using: com.hivemq.overload.protection.clients.backpressure.active

OP Clients Backpressure - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment com_hivemq_overload_protection_clients_backpressure_active

However, there is a scenario where the Overload kicks in, by having messages being queued, due to delays in third-party systems; for example, the database that is storing the permissions.

You can see it in the Extension Executor Tasks.

In some situations, Overload Protection may activate because the broker is unable to perform tasks due to slow response times from external systems. An example of such a scenario would be calls to an external database or other system for authentication. A clear indicator here would be a rise in Extension Executor Tasks, while overall system load (CPU/memory) remains low.

Extension Executor Tasks - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment com_hivemq_extension_task_executors_queued

Garbage Collector

As a Java application, HiveMQ’s process is subject to regular garbage collection by the JVM. This process will execute itself and will remove unneeded objects from memory.

There are two types:

Young Generation(com_hivemq_jvm_garbage_collector_G1_Young_Generation_time[1m])
Old Generation(com_hivemq_jvm_garbage_collector_G1_Old_Generation_time[1m])

“Young Generation” will have many spikes, which is intended, and there is no reason to worry as it is a normal process.

“Old Generation,” on the other hand, is a bit different and warrants investigation.

If your deployment is experiencing spikes in OG garbage, we don’t want to do anything that can influence or interfere with the broker.

Automated tasks (such as Backup creation) or others that create high strain on the brokers (such as Rolling Upgrades) should be completely avoided at this point.

Garbage Collector Time - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment delta(com_hivemq_jvm_garbage_collector_G1_Young_Generation_time[1m]) + delta(com_hivemq_jvm_garbage_collector_G1_Old_Generation_time[1m])

Logging Errors

Besides writing ERROR level messages to its log, HiveMQ will also increase the count of the metric com.hivemq.logging.error.total each time it does so. While WARN level messages may occur during regular operation, ERROR level messages are always a reason to take a closer look at what caused them.

HiveMQ Errors - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment sum(increase(com_hivemq_logging_error_total[1m]))

The graph will show the numbers, and also the types of errors that have been occurring.

At this point, you need to go to the logs and understand what is going on.

For additional help, you can go to the documentation or even create a support ticket.

During Upgrades

During upgrades, the first thing you need to do is ensure that you have a stable amount of resource consumption.

Whenever possible, schedule topology changes for time windows where you expect the least traffic on your deployment.

The least disruptive way to perform Rolling Upgrades is first to add the new node, then wait for synchronization and After Join Cleanups to complete. Once done, an old node is removed, and the cycle is repeated until all brokers have been replaced.

If you are forced to perform a Rolling Upgrade while your cluster is under high stress, it is helpful to give the process extra time and wait for com.hivemq.internal.singlewriter.topic.tree.remove.local.queued to reach 0.

The log message will tell you when replication is complete. This means all data has made it to the necessary targets (and if you lose a broker, the data is still safe). After Join Cleanup tells you when it is done putting, in the right places, all the data it received.

After Join Cleanup - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment com_hivemq_internal_singlewriter_topic_tree_remove_local_queued

Once this happens, you can remove a node, and repeat this process.

At the same time, you need to monitor the Cluster Nodes, and what each node is reporting.

Cluster Nodes - Metric for Monitoring HiveMQ MQTT Broker in Production Deployment com_hivemq_cluster_nodes_count

Each broker should report the same amount of cluster nodes before any further topology changes are performed. Any difference would indicate an issue with clustering and require further investigation.

Scaling the Use Case

Long-term monitoring enables meaningful evaluation of the current environment and prediction of future requirements for the environment. Meeting these requirements with adequate scale is foundational to operating a healthy and uninterrupted deployment.

Usually, in autoscaling, one of the configurations for scaling horizontally is tied to resource consumption. If the resource passes a certain threshold, this will activate the addition of another node to the cluster.

We advise against autoscaling, as adding brokers to an already strained deployment comes with an initial front-heavy load as the cluster reorganizes its data. Proactive scaling is the best and safest option.

Monitoring HiveMQ: A Comprehensive Guide

List of Required Capabilities for Operational Personal