Error Recovery

lib/v2.1 Keep the program going.

Overview

A hallmark of a control program is to get going also under error conditions. Or with other words, to attempt to recover from errors at run-time. Just printing an error message, and halting cannot be an option, considering that the responsibility of a control system is to, well, control a controlled system – that is, to keep its state within viable boundaries.¹

During development and testing, printing an error message, plus possibly error analysis information such as a stack trace, and then halting at the error point is useful. In the field, logging the error, attempting to recover, and issuing an alarm is the more useful approach. The error logs should provide sufficient useful information to analyse the problem.

Recovery Concepts

Catch the run-time error with one or more exception handlers. Compiler-inserted checks use the SVC exception to signal the problem, MCU-based faults use the corresponding system exceptions.
Collect a minimal, but useful set of data to identify and document the run-time error, right when it occurs, eg. to determine corrective actions, or later when analysing logs.
Trigger a PendSV exception to take it from there. This exception invokes an error handler that is supplied by the control program.
Due to tail chaining, the error handler finds the stack as created by the primary run-time error exception handler.

The possible and useful actions by the PendSV error handler depend on the purpose and nature of the control system as well as the controlled system. The approach here is to provide mechanisms and tools to implement this variety, not fixed policies and solutions. Default or example error handling implementations are planned, though.

Modules

With lib/v2.1, the run-time error handling in module RuntimeErrors is minimised to only catch the error, collect the essential error data, and trigger the program-specific error handler, as outlined above.

Further data collection, in particular creating a stack trace and collecting the stacked registers’ contents, is delegated to the installed error handler. Module Stacktrace provides corresponding facilities, formerly implemented in RuntimeErrors.

Module RuntimeErrorsOut provides a correspondingly extended error handler. It implements the lib/v2.0 default “catch, print, and halt” behaviour useful for development and testing.

A new module Recovery implements error handling and recovery mechanisms that restart the control program upon runtime errors and watchdog timeouts. As outlined above, the specific recovery actions highly depend on the application, ie. the control program as well as the controlled system, hence this module will be application-specific. The module is currently in the kernel directory. Making it easily replaceable by a custom module is not straight-forward with Astrobe, but that’s a problem to be tackled later.²

Planned modules include logging and auditing.

Error Causes

We have already covered software run-time errors via SVC and MCU faults. However, we need to expand our error causes by additional cases.

Watchdog: a watchdog is a hardware timer that needs to be reset (or reloaded) by the software to prevent it from “firing” (or biting?). A watchdog serves to catch the situation where the software does not advance, usually due to an infinite loop, or a deadlock. On the RP2040 and RP2350, “firing” means direct system, sub-system, or chip reset, fully implemented in hardware. Which parts of the system, or which sub-systems are being reset can be configured. Only the RP2350 can do chip-level resets.
Dead threads: or dead code, is program code that should be executed, but is not. This can be caused, for example, by a thread being starved due to higher priority threads taking precedence, or a faulty program state that prevents the code from being run. This can be tricky to detect, and must be solved in software, for example with each essential section of the program setting a mark when it runs, and an audit process detecting any missing mark. Or essential threads setting up individual time-outs to ensure they will be scheduled again.
Other: there can be additional errors, such as an overloaded scheduler (or CPU), or put the other way, if the available computing time is not sufficient to run all threads as defined in the scheduling schema.

Recovery Actions

Recovering from a run-time error means that the control system autonomously attempts to get itself back into a defined state that allows it to continue to work.

Possible actions include:

Full reset, ie. restart the whole control program anew.
Partial reset, ie. restart only a thread (control process), or groups of threads.
Disable non-essential threads.

Resets, or restarts, can be done either

with a clean slate regarding the state (or state history), ie. erase and re-initialise any state information, or
from a known state saved before the run-time error.

The RPs also allows to reset specific peripherals only, hence a process could attempt to recover this way when, say, reading faulty data from a sensor.

These possible actions can form an escalation path, for example starting from the least and then progressing to the maximum invasive in case the problem cannot be solved, for example:

Reset a peripheral only.
Restart a thread from a known state.
Restart a thread from a clean state.
Disable a thread if not essential.
Restart the system from a known state.
Restart the system completely anew.

Obviously, we can go crazy with all these possibilities, but need to take into account that all that recovery code can get pretty hairy, in particular when trying to restart from known states, which may depend on the states of cooperating threads and so on. Trying to be “too clever” here may actually result in a less robust system.

Often, simply restarting the control system with a clean slate is the most effective and robust action. We just need to make sure that the system does not enter an infinite restart-loop, which means the control program must supervise its own restarts – which is true for any recovery action taken.

However, there may be cases where the state of the controlled system has been calculated and updated over a long time period, such as the state vector of a vehicle moving using inertial navigation. Throwing away that state can not be the first option, since the vehicle would completely “forget” its current position, direction, and velocity. Or a surgical robot that is in the process of milling a seat for an implant into a bone. Better not lose that state. Even with an everyday appliance such as a washing machine, we want to recall the state, else we’ll get unhappy customers. In such a situation our control system must have a save and secured way to store and restore its state, including consistency checks and all that, possibly even across power failures.

There’s much more about fail-safe control systems. Core 1, or an interrupt handler, could fail while core 0 is writing control data to an actuator. Or storing its state. Or handling an error concurrently to the other core handling one as well. The list goes on, unfortunately. Which shows that error recovery must be specifically designed and implemented for each use case, not least also considering the nature and design of the controlled system.

Example Program

See Recovery.

Naturally, at some point recovery from errors may become impossible, and the control system has to give up, but not without attempting to leave the controlled system in a stable default state, functioning in a degraded fashion, or shutting down safely if that’s an option, but avoiding the Big Disaster. ↩︎
This situation also appears for other framework modules that may need customisation, hence a generic solution on library-level will need to be devised. ↩︎

Updated: 2025-05-18