CodeLoading – Oberon RTK

Introduction

I am working on a task system that allows to run procedures with a microseconds (us) timing, complementing the kernel’s threads, which are usually running on a milli-seconds basis. For this, the RP2040’s timer device is used, in particular its alarm feature. The concept is that each core has one of the four available alarms, and the corresponding alarm interrupt handler manages a queue of task procedures, which have been put there by a thread, to be run microseconds in the future, eg. for fine-grained sensor readings or actuator operation. The tasks procedure will be run by the alarm interrupt.

With microseconds timing, our code is literally running against the clock, since the management of the aforementioned tasks queue requires some processing (apart from the run-time of the task procedure itself). Exploring the limits of the timing, ie. what the minimum elapsed time into the future a task could be scheduled, and the minimum time between task executions, the results were somewhat baffling at first. Until I realised that the limits were given by the speed the instructions could be loaded from the program flash memory. Duh, I know.

Executing Code from Flash Memory

The RP2040 uses SoC-external flash memory, which is read serially using SSI/SPI. The Execute in Place (XIP) functionality allows to load instructions as if the serially connected flash memory were a linear address space, where addresses are translated into serial reads, transparent to the CPU and the bus.

The SSI interface runs at 1/4 of the CPU clock speed, and uses QSPI (four parallel data lines). With a 125 MHz clock, the serial interface runs at 31.25 MHz, and takes four SPI clocks to transfer a 16-bit instruction. So without any caching, reading one instruction from flash takes (1/(125 MHz/4)) * 4 seconds = 128 nano-seconds in the best case.

The RP2040 provides a 16k on-chip SRAM cache for the flash memory, to alleviate the long loading times directly from flash. I could not find any in-depth description of the caching mechanics. It works transparently from a programmer’s point of view, much like XIP itself. Cached code should load as fast as if executed from SRAM, but obviously the cache has to “warm up” (terminology from the RP2040 datasheet), meaning that first code executions go through the lengthy reads from flash. Also, we generally cannot know if a certain piece of code is in the cache, or will be loaded directly from flash.

The flash cache can be ignored, by reading from a different, mirrored address, which comes handy to determine the loading performance directly from flash memory. This “raw” flash loading performance is in particular interesting and relevant for interrupt handlers, which usually are used to react quickly to external events. They will show degraded performance on their first execution, ie. before their code is cached, or for interrupt handlers that are not triggered sufficiently frequently to keep their code in cache (eviction).

If we require guaranteed code loading performance, we need to run time critical interrupt handlers from SRAM, not flash. Which is, in fact, what the RP2040 datasheet recommends. Even from different RAM addresses for each core, if the latter use the same interrupt handler code, to avoid loading congestion and wait states.

Test Program Description

The purpose of this test example program is to measure and demonstrate the effects of

executing/loading from flash, with caching,
executing/loading from flash, without caching, and
executing/loading from SRAM.

The test procedure runs a series of loops. It’s coded in a way to get a sufficiently long run time that can be measured in microseconds using the timer device, but at the same time avoiding that caching happens too quickly.

Each run is timed with microseconds precision, and the code loads via XIP and SRAM are counted, using the corresponding bus performance measuring facilities of the RP2040.

To demonstrate how other procedures can be called from procedures executed from SRAM, the incrementation of the loop variable is artificially done via a separate procedure. As explained below, procedures must be called via a procedure variable from SRAM. To keep the loop logic as simple as possible, there are two loop procedures, one to be called when not loading the loop code into SRAM (loop), the other when executed from SRAM (loopForRam). They are identical, apart from the call mechanics. The number of instructions and thus instruction loads are slightly different between these to loop procedures, since calling a procedure variable takes a few more instructions.

Please refer to the module source code.

Executing Code from SRAM

Note that Astrobe for Cortex-M0 does not yet “officially” support the RP2040, and I don’t know if there will ever be direct support for executing code from SRAM. The concepts and approach used here work for me.

Also, my considerations and method of executing code from SRAM may only work for the cases I have examined and tested. I may well not have considered certain cases.

A procedure compiled with the Astrobe compiler is “self-contained”, regarding relocating it to a different memory section, such as SRAM, as follows:

all local variables on the stack are addressed independently of the program counter (PC) value, that is, the address of the code memory;
the module variables are addressed via “redirecting” constant values that are stored directly after the procedure in code memory, from where they are read using PC-relative addressing, hence if we copy these constant values to SRAM together with the procedure code, they will be used correctly;
calls to module-local procedures use PC-relative addressing, so unless we reconstruct the corresponding memory layout for the relevant procedures in SRAM, the call will fail. While it would be possible to create a working memory layout in SRAM, it’s easier, and more maintainable, to use module-level procedure variables for all procedures in SRAM, which will be addressed correctly as per the previous point;
calls to module-external procedures also need to be done via module-level procedure variables.

Note that we consider running code directly from SRAM for exceptional cases only, such as reaction-time sensitive interrupt handlers, hence the above condition of using procedure variables for all called procedures is not that onerous.

Copying Code from Flash Memory to SRAM

Module MemoryExt provides a facility, CopyProc, to copy a procedure to SRAM. It scans the flash program memory from the start address of the procedure for the instruction push { ... lr}, which marks the start of the next procedure, or the module’s init code, respectively, and then copies the corresponding code values to SRAM, including the aforementioned constant values right after the procedure’s executable code.

The memory layout parameters for Astrobe, extended via module Config, only use the lower 256k of the RP2040’s SRAM (four 64k banks, bank 0 to 3). The two 4k banks above that have been intentionally unused up to now. They are used to store program code in SRAM, bank 4 for core 0, bank 5 for core 1, respectively. Config has been extended accordingly.

Instructions and Results

Now, finally, the measurements and results.

Run from Flash, Cached

As stored in the repository, the test program is configured to run from flash memory, with caching, if you have Astrobe’s memory parameters configured as explained here. The system clock frequency is configured to 125 MHz.

Building and running the program should print to the serial terminal:

run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
  0    16       247         0      29         117             n/a
  1    16       247         0       5          20             n/a
  2    16       247         0       5          20             n/a
  3    16       247         0       5          20             n/a

run: test run number
iter: number of iterations of the measured loops
xip-acc: number of code loads from flash memory via XIP
ram-acc: number of code loads from SRAM
t[us]: total run time in microseconds
t/xip-acc[ns]: run time per XIP access in nano-seconds
t/ram-acc[ns]: run time per RAM access in nano-seconds

Note that the timing precision is only microseconds, the nano-seconds values are just (microseconds * 1000) DIV x. Also, there’s an inherent imprecision of 1 microsecond when reading the timer value.

Since we don’t run any code from SRAM here, the measured values for ram-acc are zero. In the runs from SRAM, the measured values for xip-acc will be zero. The test code uses the same output columns for all runs, from flash and SRAM, for easier comparison.

In the above results, the effect of the caching is clearly visible, with the first run taking significantly longer to load. Note that also for the first run, some caching already takes place, due to the while loops and the called procedure. Hence, the load time per XIP-access is not the worst case, which we will see for the non-cached access.

Run from Flash, not Cached

Now let’s ignore the flash cache. In the Astrobe configuration, change

Code Range: 010000100H, 010200000H

Code Range: 013000100H, 013200000H

Re-building and running the program should print to the serial terminal:

run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
  0    16       247         0      99         400             n/a
  1    16       247         0      99         400             n/a
  2    16       247         0      99         400             n/a
  3    16       247         0      99         400             n/a

The code load and thus execution times without caching are very long. An interrupt handler running from flash for the first time, or after it has been evicted from the cache due to non-use, will have to cope with these worst case “raw” flash loading times.

An instruction loading time of 400 nano-seconds (0.4 us) demonstrates nicely what “running against the clock” means in a microseconds timing domain, as outlined in the Introduction above. The forthcoming (or so I hope) task system cannot rely on an alarm interrupt handler in flash memory, in case we require a timing in the sub 50 microseconds range (a preliminary estimation at this point).

Run from SRAM, Flash not Cached

To load the test code into SRAM, set the module constant RAM = TRUE.

Re-building and running the program should print on the serial terminal:

run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
  0    16         0       311       5         n/a              16
  1    16         0       311       6         n/a              19
  2    16         0       311       5         n/a              16
  3    16         0       311       5         n/a              16

The load times for the test procedure are consistently low, even with the cache ignored, comparable to the cached loads above. Of course the rest of the program outside the measured procedure suffers from the long un-cached load times.

Run from SRAM, Flash Cached

Just to be complete, let’s run the test procedure from SRAM, but using the flash cache, ie. reverse the above memory configuration change in Astrobe, but leave RAM = TRUE.

run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
  0    16         0       311       6         n/a              19
  1    16         0       311       6         n/a              19
  2    16         0       311       5         n/a              16
  3    16         0       311       5         n/a              16

As expected, the results are the same as for the case with the flash cache ignored (apart from the aforementioned unavoidable microseconds timing imprecision).

Bottom Line

If we need a guaranteed reaction time of an interrupt handler, we better install it in, and execute it from SRAM. Loading from flash is rather slow. For an interrupt handler that is triggered on a regular basis, such as a milli-seconds SysTick, we can be confident – but not sure – that its code is in the on-chip cache, from where it will be executed as fast as installed in SRAM. If the first execution of an interrupt handler must be fast, it cannot reside in flash.

Caveat

Module RuntimeErrors will currently not produce valid stack traces for code in SRAM, and report the SRAM address containing the faulty instruction, not the flash memory address.

Output Terminal

See Set-up, one-terminal set-up.

Build and Run

Build module CodeLoading with Astrobe, and create and upload the UF2 file using abin2uf2. Note the change of Astrobe settings as described above for certain measurements (disabling the flash cache).

Set Astrobe’s memory options as listed, and the library search path as explained.

Repository

CodeLoading.mod