Two Processor Cores – Oberon RTK

Introduction

Aspects

Having two processor cores available, certain aspects of creating programs and library modules need specific considerations, including:

concurrent execution of the same code at the same time by both cores
memory segmentation and allocation between the cores
data sharing and separation
module initialisation
use of hardware resources, such as peripheral devices

Basics

Generally speaking, the code needs to be re-entrant and thread-safe, meaning that the two cores can execute the same code at the same time, without any impact on control flow and results, or the operation the hardware devices (peripherals).

For any procedure, this means at minimum:

don’t hold state between calls, and
don’t use non-local variables temporary storage, eg. string buffers

With other words, the crucial question is: can this procedure be called at anytime from both cores, and also – not directly related to the two core architecture – can this procedure be interrupted at any time, and the “interrupter” can also call it? Such interruption could be caused by a hardware interrupt, or by a preemptive scheduler, time-sliced or not. For interrupt handlers, we can define restrictions for what is allowed, for library modules in general we should not need to stipulate such restrictions.

Temporary storage should be

allocated on the stack, or
provided by the caller via the API

If a procedure’s state needs to be retained, the caller needs to provide the corresponding storage via the API. For state not pertaining to a single procedure, see next point.

Synchronisation

Related to state, another issue is (potentially) shared data and devices. We need to ensure that calls from both cores (or interrupts) do not

corrupt the shared data, or
interfere with the physical operation of a device.

This can be achieved by:

convention and program design: we assign specific data or device to a core, that is, only one core is allowed to mutate and use it it
mutual lockout: we ensure that the shared data is never accessed by one core while the other still mutates it, or a device is never used by two cores at the same time.

Convention and program design can mean:

implemented in software: the procedure detects which core is calling it, and selects the corresponding data
enforced in software: generate a runtime error, since its violation is a programming error (or malicious code)
a rule that programmers abide by: “just don’t do it”

depending on the nature of our control program and the modules.

The mutual lockout (or synchronisation) can be internal or external:

internal: the procedure itself claims a lock and releases it, transparent to the caller;
external: we require the caller to claim a lock before calling, and then release it when done.

With internal synchronisation, only one single call can be protected, with external synchronisation the protection can span several procedure calls, for example to lock down a serial device to call several output procedures to complete the desired output.

Before we look at different modules in the RTK framework against the above background, let’s have a look at the memory organisation.

Memory Segmentation and Access

The RP2040 is a bare metal device, and unless we install an operating system, we are confronted with all the nitty-gritty details of memory management from the ground up. We have to to design and implement appropriate solutions for our two-core programs. We need stacks, heaps, vector tables, and some space for module variables.

In general, each core of the RP2040 can access the whole memory space.¹ This includes SRAM, flash memory, and peripheral device addresses. For the program code, both cores must be able to access the same address space, be it in flash memory or SRAM.

Regarding volatile data storage, a processor core does not “know” about heap, or even stack memory, it’s the compiler and linker that instructs the CPU accordingly. Yes, the processor has specific instructions that facilitate using a stack, for example, since its designers assumed that one will be used, but we could write whole programs without using a stack or heap.

Of course we don’t want to go back to the stone ages, and not use a stack. Or two stacks, with two cores. Therefore we need to segment the available SRAM. At the minimum, each core requires a stack, with a stack starting address that provides sufficient space to grow downwards without interfering with the other, or the heap.

The initial stack address is provided

for core 0 during the boot sequence, read from the binary code file,
for core 1 when it gets activated by core 0: one of the values that core 0 must send to core 1 through the inter-core FIFO is the initial stack address.

Heap memory space and the module variables space is completely defined in software.

Memory Map

Astrobe for Cortex-M0 lets you slice and dice the memory map in many ways, using the configuration options – within the physical confines of the device. You specify the address range for the program code, and the one for data. Astrobe does not yet have options for more than one core – stay tuned –, so we provide a solution via module Config, which extends the options for the other core.

The chosen memory map for SRAM and flash memory looks thusly:²

  SRAM:
  +---------------------------+ 020040000H = CoreOneStackStart
  |    core 1 stack           |
  |                           |
 ~~~                         ~~~
  |                           |
  |    core 1 heap            | 020030200H = CoreOneHeapStart
  +---------------------------+
  |                           |
  |    core 1 vector table    | 020030000H = CoreOneDataStart = LinkOptions.DataEnd
  +---------------------------+
  |    module data            |
  |                           |
  +---------------------------+ CoreZeroStackStart = LinkOptions.StackStart
  |    core 0 stack           |
  |                           |
  |                           |
 ~~~                         ~~~
  |                           |
  |                           |
  |    core 0 heap            | 020000200H = CoreZeroHeapStart = LinkOptions.HeapStart
  +---------------------------+
  |                           |
  |    core 0 vector table    | 020000000H = CoreZeroDataStart = LinkOptions.DataStart
  +---------------------------+

  Flash memory:
  +---------------------------+ 010200000H = CodeEnd = LinkOptions.CodeEnd
  |                           |
  |                           |
 ~~~   code (shared)         ~~~
  |                           |
  |                           | 010000100H = CodeStart = LinkOptions.CodeStart
  +---------------------------+
  |    boot code phase 2      | 010000000H
  +---------------------------+

Using the kernel, the stack space for one core looks like this:

  +---------------------------+
  |    main stack (MSP)       |
  +---------------------------+
  |    thread 0 stack (PSP)   |
  +---------------------------+
  |    thread 1 stack (PSP)   |
  +---------------------------+
  |    thread 2 stack (PSP)   |
  +---------------------------+
 ~~~                         ~~~
  +---------------------------+
  |    thread n stack (PSP)   |
  +---------------------------+
  |                           |
  |                           |

The main stack (via MSP) is first used for initialisation, and then for exception handling. The threads use their own stacks, using the PSP.

Let’s focus on SRAM, and have a look at the different parts of the memory map.

Stack Memory

Each core has to get its own stack memory space, there’s no way around it. It’s the storage for local variables. We’ve mentioned above that temporary data should be kept on the stack to avoid any sharing of buffers and the like. Entries on the stack can point to non-local storage space though, via VAR and (value) POINTER parameters.

Heap Memory

The two cores could share a common heap, but this would require coordination, or arbitration, for the dynamic allocation of memory via NEW. Hence, each core gets its own heap memory space. Note that, once allocated, each core can access the heap memory of the other core anyway, provided it gets a corresponding POINTER.

Module Data

These are the VARs declared at the module level. Subject to data visibility and accessibility defined for modules, module data is shared between cores.

Vector table

Each core gets its own exception vector table. They could share one, as each core can be given the address of the vector table (MCU.SCB_VTOR). But exception handling is done by each core separately, based on its interrupt signals and settings of priorities etc., and having each core installing exception vectors in a shared table just opens unnecessary complexity.

Device Registers

Most of the RP2040’s device registers have fully atomic set, clear, and xor aliases, using masks, hence they can be mutated without the need for a read-modify-write cycle, making writing the corresponding procedures much easier.

Program Initialisation

Here’s the initialisation sequence for example program NoBusyWaiting:

04CE4H  0F7FBFB8EH  bl.w     LinkOptions..init
04CE8H  0F7FBFB94H  bl.w     MCU2..init
04CECH  0F7FBFBC8H  bl.w     Config..init
04CF0H  0F7FBFC06H  bl.w     Resets..init
04CF4H  0F7FBFD26H  bl.w     GPIO..init
04CF8H  0F7FBFD4CH  bl.w     PowerOn..init
04CFCH  0F7FBFEC0H  bl.w     Clocks..init
04D00H  0F7FBFF52H  bl.w     MAU..init
04D04H  0F7FCF98EH  bl.w     Memory..init
04D08H  0F7FCF9A2H  bl.w     LED..init
04D0CH  0F7FCFE5EH  bl.w     RuntimeErrors..init
04D10H  0F7FDFABEH  bl.w     Error..init
04D14H  0F7FDFAFAH  bl.w     TextIO..init
04D18H  0F7FDFD52H  bl.w     Texts..init
04D1CH  0F7FDFF3CH  bl.w     ResData..init
04D20H  0F7FEFB52H  bl.w     RuntimeErrorsOu..init
04D24H  0F7FEFC8CH  bl.w     UARTdev..init
04D28H  0F7FEFCCCH  bl.w     UARTstr..init
04D2CH  0F7FEFD44H  bl.w     Coroutines..init
04D30H  0F7FEFD6EH  bl.w     SysTick..init
04D34H  0F7FFF994H  bl.w     Kernel..init
04D38H  0F7FFF9E6H  bl.w     UARTkstr..init
04D3CH  0F7FFFA8CH  bl.w     Terminals..init
04D40H  0F7FFFB74H  bl.w     Out..init
04D44H  0F7FFFBBCH  bl.w     Main..init
04D48H  0F7FFFC6CH  bl.w     MultiCore..init
04D4CH  0F7FFFD6EH  bl.w     Exceptions..init
04D50H  0F7FFFDE4H  bl.w     Timers..init
04D54H  0F7FFFF68H  bl.w     NoBusyWaitingC1..init
04D58H  0F7FFFFBEH  bl.w     NoBusyWaitingC0..init

The bodies of all modules are run in sequence as determined by the Astrobe build system – by core 0. Core 1 is not even awkened yet. Hence, no data corruption can happen due to concurrent data access, until NoBusyWaitingC0..init is executed, calls NoBusyWaitingC0.run, which in turn wakes up core 1 by passing NoBusyWaitingC1.Run to MultiCore.InitCoreOne.

In order to maintain the correctness of the program, core 1 should only ever be activated from the program for core 0, that is, as soon as the complete module initialisation sequence has finished.

Let’s have a look a the different RTK framework modules and consider which memory is being allocated, and where, and how it is used.

Pure Hardware Access

Some modules operate the devices directly without creating data structures that represent abstractions to the hardware: Resets, GPIO, PowerOn, Clocks, LED. Any state is held directly in hardware, and the corresponding access procedures only use local storage on the stack, if any. Access by different cores is coordinated by program design (agreement, convention). Access restrictions from one core or the other for specific hardware resource could be implemented and enforced in software, though, depending on the requirements regarding run-time robustness.

A special case in this category are the device register addresses that are the same for each core, but actually operate on different hardware devices.³ One core does not “see” the other core’s registers in that address range. They include the system tick, system control block, and memory protection unit. Currently available related modules are Exceptions⁴ and SysTick.

Memory Allocation

About Recursion and Dynamic Memory Allocation

Like recursion, dynamic memory allocation is often not recommended (or even forbidden) by programming guidelines for control programs. The problem of course is: what does an unsupervised embedded program do in the case of a stack overflow due to recursion, or when all available heap memory is used? If the stack overflow is even detected, and does not just corrupt data.

I would strongly agree with not using recursion, as it can be replaced by iteration, but I think dynamic memory allocation can be allowed as follows:

RTK uses RECORDs to describe and represent hardware devices, and POINTERs to these records can easily be “passed around” and used as procedure parameters to select a specific instance of as device, such an UART. Such RECORDs are only allocated if the program will actually use the device.
We know exactly which hardware devices will be used in our program, hence we can allocate all the needed device RECORDs at the start-up of the program. After the initialisation of the program, NEW is not required anymore, and we can even lock down the heap thereafter. During initialisation, we can ASSERT(p # NIL) for hard checks, since we’re catching program design or implementation errors, not run-time errors.
The same holds for other data structures, eg. in the kernel: create at start-up, assert for NIL.
If we use the kernel, the same holds, mutatis mutandis, for stack allocation.

So:

no recursion
allocate all heap and stack memory during program initialisation
possibly lock all allocation after the initialisation, eg. with Memory.LockHeaps.

Heap Memory and Stack Allocators

Module Memory implements the memory allocators for both cores. The data to manage these storage areas is held in two equal data structures, one per core, in the module data space. The access procedures check on which core they are running by enquiring the core number from the hardware (MCU.SIO_CPUID), then select the corresponding management data structure, and only ever mutate data in that RECORD. No conflict possible between the cores.

With cooperatively scheduled threads, or implicit threads with a “main loop program”, this arrangement also suffices to avoid conflict between threads. With a preemptive scheduler it suffices if all allocations are done during initialisation. Interrupts should not be allowed to allocate memory in any case.

Read-only Data

Some modules, among them Astrobe’s LinkOptions, as well as Config, set variables during initialisation that will never change during the run-time of a program. This kind of “life time” read-only variable is safe to access from both cores at any time.

While the exported variables of LinkOptions and Config are read-only enforced by the compiler, there are other data structures that are read-only by convention, see UARTdev.Device below.

Kernel

All data structures for the kernel data, as well as for threads and their coroutines, are created at start-up. Each core runs its own kernel and scheduler, and each thread is allocated to a core, hence all related data are strictly separated from each other. Access employs the same scheme as module Memory. All kernel data can only be accessed using the corresponding procedures (full encapsulation).

Data Structures Representing Hardware Devices, eg. UARTdev.Device

Module UARTdev, for example, defines a RECORD UARTdev.Device that represent one UART instance in software, and is accessed using a POINTER. The corresponding memory is allocated at start-up. The IO procedures in modules UARTstr and UARTkstr take a UARTdev.Device as parameter to access the specific hardware device. While it would be possible to make the fields of UARTdev.Device only accessible via corresponding procedures, a compromise was chosen for performance reasons, namely to make the address values for the transmit, receive, and flag registers directly accessible by making them public.

Of course, this makes these register addresses also open for modification by all modules that use UARTdev.Device, such as UARTstr and UARTkstr. The solution here is a programmer guideline: don’t do it. As long as we don’t expect any malicious code, this is sufficient. More protection is always possible, but at the cost of more code complexity as well as run-time overhead.

While the procedures in UARTstr and UARTkstr are re-entrant – they only use local storage, and two concurrent calls can happily share the same instance of UARTdev.Device in read-only mode – the use of the actual hardware of course is not. The two calls would attempt to write and read the same hardware registers at the same time, resulting in havoc. Hence, the program design must ensure that the two cores do not attempt to use the same UART at the same time.

See Text Output and Input for possible solution, which can be tweaked in many ways to fit a program’s needs. The software is re-entrant, the synchronised access to the hardware peripherals is solved by program design, including the use of signals, semaphores, or messages.

Run-time Error Handling

There can only happen one run-time error at one time per core.⁵ Module RuntimeErrors provides a separate set of module variables for the data structures to hold error and fault data. It uses the same access separation mechanics as module Memory.

Module RuntimeErrorsOut, for printing the error data collected by RuntimeErrors, uses the TextIO.Writer channel infrastructure (Text Output and Input), ie. part re-entrant procedures, part program design to access the output peripheral device.

See Runtime Errors.

Summary

The RTK framework’s procedures are written in a re-entrant fashion:
- no state
- temporary data local on the stack
Data structures are separated per core, and accessed only from software on that core, which is implemented in software:
- memory management (heap, stacks)
- run-time errors
- kernel
The driver software for hardware peripheral devices is re-entrant, ie. the cores can use the same procedures at the same time, while access to the peripheral hardware devices themselves is synchronised by program design, for example:
- UART devices
- SPI devices
- GPIO via SIO
All memory allocation is done at program start-up, either during the module initialisation, or by the programs for the two cores.
Only wake up core 1 as soon as all modules are initialised.
Shared read-only data is
- defined at start-up, and cannot be mutated: LinkOptions, Config
- defined when the data structures for devices are set up, and not mutated by their drivers by program design: eg. UARTdev, UARTstr, UARTkstr

Unless we implement some protection scheme using the Memory Protection Unit (MPU). ↩︎
The actual values used may be different when you read this, but the principle remains. ↩︎
All registers that start with M0PLUS_. ↩︎
Exception (pun intended): Exceptions.SetNMI, which uses registers that are part of the system configuration address block. There we have one register per core to configure their respective NMIs (SYSCFG_PROC0_NMI_MASK, SYSCFG_PROC1_NMI_MASK). ↩︎
Depending on the design and implementation, this can be different with a time-slicing thread scheduler. ↩︎