Managing IoT Device State Within MQTT
As the scale of IoT solutions grows, understanding the current state of client devices & services is not just a feature of these solutions but a necessity for creating efficiency and responsiveness. A common yet flawed approach to this recurring architectural challenge is to create external systems, some form of dedicated ‘device state service’, for storing and querying that information. While functional, this method introduces its own complexities and dependencies, making it less ideal. Fortunately, the flexible and open nature of the MQTT protocol and fully-compliant MQTT brokers like HiveMQ provides a reliable and scalable way to implement device state discovery — no external services required.
Identifying the Antipattern in IoT Device Connectivity
Let’s first consider the commonly applied solution of a Device State Service, an external function that might retain the current state of devices in a persistent database and make available an API for querying that state. While initially this seems like a reasonable and functional solution to a common challenge when working with distributed IoT solutions, it should be considered an antipattern.
Antipatterns are frequently used solutions to common architectural challenges that appear to be effective but can lead to negative consequences.
The overhead of maintaining separate systems, higher risk of data inconsistency, and increased latency — not to mention introducing a new dependency — can impede the effectiveness of IoT solutions.
This solution might work well during development and testing but fail to scale with production workloads.
Proposing an MQTT-centric Solution
Where possible, we look to identify MQTT-native techniques to solve these recurring architectural challenges and document proven patterns that operate at scale. Implementing device state by leveraging the existing capabilities of MQTT assures us that the solution will scale along with our broker.
It’s worth noting that specifications built on top of MQTT, such as Eclipse’s Sparkplug specification for Industrial IoT, have considerations built into them for addressing device state (see Sparkplug Session State Management). As these specifications are a collection of design & architectural choices defined on top of MQTT, it is possible to pick elements that we need for other projects without having to adopt the complete specification. HiveMQ has an excellent whitepaper on building a specification on top of MQTT that highlights this approach.
A simple, yet highly effective and scalable approach to implementing device state in MQTT is to use a device-specific topic, retained messages, and the Last Will and Testament feature of MQTT. When a device connects to the broker, it includes a Will, to be published by the broker to a device-specific topic, with a payload indicating an "offline" status, in the event of an ungraceful disconnect.
After connecting, the device publishes a retained message to that same device-specific topic, with a payload that indicates an "online" status. The message indicating the "online" status is retained by the broker and pushed to any subscribers of that topic, including those that may subscribe in the future (Diagram 1). If the device were to disconnect ungracefully, the broker would then publish the Will message to the device-specific topic, superseding the previously retained message, and pushing it to any subscribers (Diagram 2).
A Scalable Solution to Seamless IoT Device State Management
Let’s take a more detailed look at how this solution works by investigating each component.
Device-Specific Topics & Wildcard Subscriptions
For a solution like this to work, and as a general best practice, it is recommended to design topic structures that include the device’s unique identifier at some level of the topic hierarchy. This enables many use cases and design patterns that are critical to building a scalable IoT solution. Whether state management, as this blog post covers, or granular authorizations, device commands, or Over-The-Air firmware updates etc., having a way to uniquely identify a device by the topic hierarchy is valuable. Examples might include using the Vehicle Identification Number in the topic structure of a connected car solution (cc/v1/uniqueVIN/state) or the Edge Node Descriptor of Sparkplug in IIoT settings, which is a combination of the Group ID and the EoN Node ID (spBv1.0/groupID/NBIRTH/eonNodeID).
To increase the effectiveness of this technique, wildcard subscriptions can be used if a given service needs to be aware of the current state of all devices in the hierarchy. In our earlier connected car example, a subscriber to the topic cc/v1/+/state
would have the state changes of ALL vehicles pushed to it.
MQTT Retained Messages
In order to ensure that the device’s state is available to new subscribers, even if they weren’t subscribed at the time of the change in state, we use retained messages in conjunction with our device-specific topics. The MQTT protocol only allows for one message to be retained per topic, which ensures that only the most recently updated state is held. Any clients that subscribe to that device’s topic, or a wildcard topic that contains it — even after the initial message has been published — will receive the retained message. The device itself is responsible for publishing its “online” status after it makes a successful connection while the “offline” status is handled by the broker, through the Will mechanism.
MQTT Will Message
To update the status of a device that has been disconnected, perhaps due to network failure, broker action, or other ungraceful disconnection, we rely on the Will mechanism of MQTT, sometimes also known as Last Will and Testament (LWT). As part of the initial CONNECT message, a device can include the Will flag, the topic for the Will message, its QoS level, whether it should be retained, and the payload. In the event of an ungraceful disconnection, the broker must publish the Will messages to the defined topic, which is then pushed to all subscribers. The published and retained Will message and payload indicating the “offline” status of the device then supersedes any previous state. If the device reconnects, it publishes its “online” status, which in turn supersedes the Will message.
Further Enhancements & Advanced Use Cases
Thus far, examples have used a simplistic “online” or “offline” status as the payload for the state messages. However, these could be enhanced to implement additional functionality. A mechanism to include a timestamp with the Will payload could provide some useful context. Similarly, adding a reason code for the ungraceful disconnection could help in troubleshooting. Adding a mechanism for the device to publish a graceful disconnect message prior to sending its DISCONNECT notification would distinguish between planned and unplanned disconnections. Each of these would require some client-side code or a broker-side extension to implement, as we have done with the HiveMQ Sparkplug Aware extension.
An enhanced "offline" status message might look like this:
Similarly, the “online” status payload could include additional information about the device, such as a firmware version, model number or even a payload schema. The SparkplugB specification is a fantastic example of these features being implemented in a robust and scalable manner. You can read more about how they are implemented with Sparkplug here:
HiveMQ Sparkplug Essentials - Session State Management
HiveMQ Sparkplug Essentials - Payload Structures
HiveMQ Sparkplug Essentials - Operational Behavior
Another advanced use case might be a circumstance where multiple HiveMQ clusters are in use and the state of devices must be replicated across broker clusters. With the described MQTT native solution, this replication is possible without additional complication, thanks to offerings like HiveMQ Enterprise Bridge Extension. Publishing an identifier for the connected cluster as part of the “online” status payload could further enhance this use case.
Conclusion
Adopting an MQTT-centric approach for managing device states offers a more integrated, near-real-time, and scalable solution compared to external systems. This strategy not only simplifies the architecture but also enhances the reliability and responsiveness of IoT systems. We encourage IoT developers and architects to explore this approach in their solutions, and we welcome any feedback or questions on this topic.
Magnus McCune
Magnus is a Principal Architect at HiveMQ. He is a passionate technologist with a proven background solving complex business and technical challenges through the design, implementation and operationalization of cloud and edge technologies. His expertise extends to network, cloud, & infrastructure architecture, cloud-native solutions design and large-scale automation projects.