Error Recovery

lib/v2.1 Keep the program going.

Status

This Document

Half-baked draft.

Library

Currently, the run-time error handling in lib/v2.1 is modified according to sections Recovery Concepts and Modules below. Module Main configures the behaviour of lib/v2.0 (“print and halt”).

The current work focuses on allowing and supporting a straight-forward full restart of the system.

Overview

A hallmark of a control program is to get going also under error conditions. Or with other words, to attempt to recover from errors at run-time. Just printing an error message, and halting cannot be an option, considering that the responsibility of a control system is to, well, control a controlled system – that is, to keep its state within viable boundaries.1

During development and testing, printing an error message, plus possibly error analysis information such as a stack trace, and then halting at the error point is useful. In the field, logging the error, attempting to recover, and issuing an alarm is the more useful approach. The error logs should provide sufficient useful information to analyse the problem.

Recovery Concepts

  1. Catch the run-time error with one or more exception handlers. Compiler-inserted checks use the SVC exception to signal the problem, MCU-based faults use the corresponding system exceptions.
  2. Collect a minimal, but useful set of data to identify and document the run-time error, right when it occurs, eg. to determine corrective actions, or later when analysing logs.
  3. Trigger a PendSV exception to take it from there. This exception invokes an error handler that is supplied by the control program.
  4. Due to tail chaining, the error handler finds the stack as created by the primary run-time error exception handler.

The possible and useful actions by the PendSV error handler depend on the purpose and nature of the control system as well as the controlled system. The approach here is to provide mechanisms and tools to implement this variety, not fixed policies and solutions. Default or example error handling implementations are planned, though.

Modules

With lib/v2.1, the run-time error handling in module RuntimeErrors is minimised to only catch the error, collect the essential error data, and trigger the program-specific error handler, as outlined above.

Further data collection, in particular creating a stack trace and collecting the stacked registers’ contents, is delegated to the installed error handler. Module Stacktrace provides corresponding facilities, formerly implemented in RuntimeErrors.

Module RuntimeErrorsOut provides a correspondingly extended error handler. It implements the lib/v2.0 default “catch, print, and halt” behaviour useful for development and testing.

Error Causes

We have already covered software run-time errors via SVC and MCU faults. However, we need to expand our error causes by additional cases.

  • Watchdog: a watchdog is a hardware timer that needs to be reset (or reloaded) by the software to prevent it from “firing” (or biting?). A watchdog serves to catch the situation where the software does not advance, usually due to an infinite loop, or a deadlock. On the RP2040 and RP2350, “firing” means direct system, sub-system, or chip reset, fully implemented in hardware. Which parts of the system, or which sub-systems are being reset can be configured. Only the RP2350 can do chip-level resets.

  • Dead threads: or dead code, is program code that should be executed, but is not. This can be caused, for example, by a thread being starved due to higher priority threads taking precedence, or a program state that prevents the code from being run. This can be tricky to detect, and must be solved in software, for example with each essential section of the program setting a mark when it runs, and an audit process detecting any missing mark. Or essential threads setting up individual time-outs to ensure they will be scheduled again.

Recovery Actions

Recovering from a run-time error means that the control system autonomously attempts to get itself back into a defined state that allows it to continue to work.

Possible actions include:

  • Full reset, ie. restart the whole control program anew.
  • Partial reset, ie. restart only a thread (control process), or groups of threads.

Both resets, or restarts, can be done either

  • with a clean slate regarding the state (or state history), ie. erase and re-initialise any state information, or
  • from a known state saved before the run-time error.

The RPs also allows to reset specific peripherals only, hence a process could attempt to recover this way when, say, reading wrong data from a sensor.

These possible actions can form an escalation path, for example starting from the least and then progressing to the maximum invasive in case the problem cannot be solved, for example:

  1. Reset a peripheral only.
  2. Restart a thread from a known state.
  3. Restart a thread from a clean state.
  4. Restart the system from a known state.
  5. Restart the system completely anew.

Obviously, we can go crazy with all these possibilities, but need to take into account that all that recovery code can get pretty hairy, in particular when trying to restart from known states, which may depend on the states of cooperating threads and so on. Trying to be “too clever” here may actually result in a less robust system.

Often, simply restarting the control system with a clean slate is the most effective and robust action. We just need to make sure that the system does not enter an infinite restart-loop, which means the control program must supervise its own restarts – which is true for any recovery action taken.

However, there may be cases where the state of the controlled system has been calculated and updated over a long time period, such as the state vector of a vehicle moving using inertial navigation. Throwing away that state can not be the first option, since the vehicle would completely “forget” its current position. In such a situation our control system must have a save and secured way to store and keep its state, including consistency checks and all that.


  1. Naturally, at some point recovery from errors may become impossible, and the control system has to give up, but not without attempting to leave the controlled system in a stable default state, functioning in a degraded fashion, or shutting down safely if that’s an option, but avoiding the Big Disaster. ↩︎

Updated: 2025-05-04