Simplify debugging to remove complexity from embedded system development

January 16, 2023 · 4 min read

CEO @ Luos

This article was originally for Techcrunch (available with a Techcrunch+ subscription).

The complexity associated with the development of embedded systems is increasing rapidly. For instance, it is estimated that the average complexity of software projects in the automotive industry has grown 300% over the past decade.

Today, every piece of hardware is driven by software, and most hardware is composed of multiple electronic boards running synchronized applications. Devices have more and more features, but adding features means increasing development and debugging complexity. A University of Cambridge report found that developers spend up to 50% of their programming time debugging. But there are practical ways to reduce the complex debugging of embedded systems. Let’s explore those.

Earlier is better

Bugs will pop up during your entire product's lifetime: in development, testing, and in the field. Resolving a bug later down the road can increase the cost – as much as 15 times – and leads to user frustration and challenges associated with updates of embedded devices that are in production. By comparison, identifying bugs at the early stages of your product will allow you to keep and track these bugs while prioritizing their severity before other dependencies and variables are introduced later in the lifecyle, which makes them easier, and cheaper, to solve.

Manage versioning

To properly replicate a bug, you need to be able to have the device in the exact same state as when it happened. With embedded devices, there are three distinct variables to look at when issues crop up.

The first one is the software version, that is the version of each feature, such as motor software version v1.2.4, or sensor filtering v4.3.2 – this applies to the code you build but also potential dependencies such as imported libraries.

The second one is the board version, more specifically, the design of the board. Board design is constantly changing – adding or removing a component or moving the location.

The third one is manufacturing, which assembly line made the board and at what time. For this specific element, it is not a version but a unique serial number that is used for every single card. When designing the code, anything that references one of these three elements must be made a variable. To manage this versioning granularity, you need a registry. Open-source PlatformIO is a great tool to achieve this.

Operationalize the replicability

Once the ability to fully define a given device state is made possible, you need to be able to actually replicate it on a local device so that you can debug. For that, you need a script that will pull out the required version, compile the right binaries, and push them to your product. Here is a code snippet containing a script I use for one of my projects.

Additionally, when you have a bug, you must find the simplest configuration possible to reproduce it easily and limit the area of code to inspect. By managing your features independently, you can easily enable and disable each of them on your code. The best way to achieve this is to compile each feature as a library, where each feature should have an init and loop function – Arduino style – that can be called from the main file.

Trace everything

Now, it's debugging time. But your debugging will only be efficient if you have the right information. Traces – which log the low-level event of the program's execution – are a must-have here. Both hardware and software features must generate traces for everything they do. Tools such as open-source Memfault or Freedom Robotics can help you get there.

While your device should constantly be generating traces, only when an issue occurs should traces automatically be saved and sent back to you so that you can analyze them. And to be able to properly capture as many anomalies as possible, you must anticipate their types. They take different shapes with embedded devices, while it might be a software issue, it could also be hardware issues such as overheat, water damage, or broken components. The sky's the limit with embedded systems, for instance, one of our users is building articulated arm robots that perform sensitive maintenance operations in nuclear facilities, exposing the hardware to high levels of radiation which can impact the hardware and software in random ways.

Ensure timeline consistency

Another key component of traces is timing. Because embedded devices are often made of multiple cards with multiple inputs such as sensors and user input, and outputs such as engine and screen, timing is a key component to track so that you can reconstruct a timeline of what happened. The tracking needs to happen at the millisecond, sometimes nanosecond level, and each timing needs to be precisely aligned with other components. Because a device can have different microcontroller units (MCUs), started at different times, cadenced at different frequencies, with different temperatures – timelines across components can drift.

There are two ways to ensure traces can be timed correctly. The first way is to synchronize time across different cards – to have a coherent timestamp of data across different nodes – by sending specific updating messages. Depending on how much the time is drifting, you will need to adjust the frequency of those messages. But because this synchronization message needs a predictable latency guaranteeing the accuracy of the date, devices generally need to stop every operation in the network to ensure that the latency will always be the same. This can be problematic for some products.

A second and new way of doing it, pioneered by a paper from the University of Berkeley, is to embrace latency – and compute timelines based on it. Latency is a sum of delays. By measuring delays across the product, latency can be calculated, and a timeline can be reconstructed.

sourceLatency = communicationDate - dataGenerationDate
targetLatency = dataConsumingDate - communicationDate
networkLatency = propagationTime + IRQraise
totalLatency = sourceLatency + targetLatency + networkLatency

The advantage of that method is that it’s constantly producing consistent results without having to worry about the frequency of synchronization messages and without the need of stopping every other operation in the network. I wrote a detailed paper on how to implement this methodology for embedded purposes.

Look for bug trends

Finally, with embedded projects, issues can often come from a specific part and assembly of the implementation. That is why keeping track of your bug history is important to allow you to identify trends of problematic areas or a set of devices that have a specific set of versions, as quickly as possible. Open-source tools such as Luos or Freedom Robotics can help you to accurately monitor the events that occur in your embedded system to resolve them more easily and quickly.

Get started with Luos