CodeLoading – Oberon RTK

Introduction

When working in tight timing situations, such as micro-seconds alarm scheduling, or low latency interrupt handling, the rate at which instructions are loaded for execution becomes relevant. This program evaluates instruction reading speeds for different memory situations and set-ups.

Executing Code from Flash Memory

The RP2040 and RP2350¹ use SoC-external flash memory, which is read serially using SSI/SPI. The Execute in Place (XIP) functionality allows to load instructions as if the serially connected flash memory were a linear address space, where addresses are translated into serial reads, transparent to the CPU and the bus.

The SSI interface runs at 1/4 of the CPU clock speed, and uses QSPI (four parallel data lines). With a 125 MHz clock, the serial interface runs at 31.25 MHz, and takes four SPI clocks to transfer a 16-bit instruction. So without any caching, reading one instruction from flash takes (1/(125 MHz/4)) * 4 seconds = 128 nano-seconds in the best case.

Both RPx provide a 16k on-chip SRAM cache for the flash memory, to alleviate the long loading times directly from flash. Cached code should load as fast as if executed from SRAM, but obviously the cache has to “warm up”, meaning that first code executions go through the lengthy reads from flash. Also, we generally cannot know if a certain piece of code is in the cache, or will be loaded directly from flash. The 16 kB cache can only mirror so much data of the 2 or 4 MB flash memory spaces.

The flash cache can be ignored, either by reading from a different, mirrored address, or by disabling the cache, which comes handy to determine the loading performance directly from flash memory. This “raw” flash loading performance is in particular interesting and relevant for interrupt handlers, which usually are used to react quickly to external events. They will show degraded performance on their first execution, ie. before their code is cached, in particular for interrupt handlers that are not triggered sufficiently frequently to keep their code in cache (eviction).

If we require guaranteed code loading performance, we need to run time critical interrupt handlers from SRAM, not flash memory. Which is, in fact, what the RP2040 datasheet recommends. Even from different RAM addresses for each core, if the latter use the same interrupt handler code, to avoid bus congestion and wait states.

A further possibility is to “pre-cache” a procedure, ie. reading the code sufficiently close before it is executed, to load it into cache.

The RP2350 also allows to pin cache lines for defined flash memory (XIP) addresses, so that accessing these addresses is guaranteed to read from the cache. This option is not covered in this test program.

Test Program Description

The purpose of this test example program is to measure and demonstrate the effects of

executing/loading from flash, with caching, and
executing/loading from flash, without caching, and
executing/loading from SRAM, and
executing/loading after pre-caching.

The test procedure runs a series of loops. It’s coded in a way to get a sufficiently long run time that can be measured in microseconds using the timer device, but at the same time avoiding that caching happens too quickly.

Each run is timed with microseconds precision, and the code loads via XIP and SRAM are counted, using the corresponding bus performance measuring facilities of the RP2040 and RP2350.

To demonstrate how other procedures can be called from procedures executed from SRAM, the incrementation of the loop variable is artificially done via a separate procedure. As explained below, procedures in SRAM must be called via a procedure variable. To keep the loop logic as simple as possible, there are two loop procedures, one to be called when not loading the loop code into SRAM (loop), the other when executed from SRAM (loopForRam). They are identical, apart from the call mechanics. The number of instructions and thus instruction loads are slightly different between these to loop procedures, since calling a procedure variable takes a few more instructions.

The test program is implemented in two different modules, one for each MCU, since the flash memory access is different for the RP2040 and the RP2350 (see links at the end).

Executing Code from SRAM

Astrobe for RP2040 and for RP2350 do not support executing code from SRAM. The concepts and approach used here work for me for this test program. Also, my considerations and method of executing code from SRAM may only work for the cases I have examined and tested. I may well not have considered certain cases.

A procedure compiled with the Astrobe compiler is “self-contained”, regarding relocating it to a different memory section, such as SRAM, as follows:

all local variables on the stack are addressed independently of the program counter (PC) value, that is, the address of the code memory;
the module variables are addressed via constant values that are stored directly after the procedure in code memory, hence if we copy these constant values to SRAM together with the procedure code, they will be used correctly;
calls to module-local procedures use PC-relative addressing, so unless we reconstruct the corresponding memory layout for the relevant procedures in SRAM, the call will fail. While it would be possible to create a working memory layout in SRAM, it’s easier, and more maintainable, to use module-level procedure variables for all procedures in SRAM, which will be addressed correctly as per the previous point;
calls to module-external procedures also need to be done via module-level procedure variables.

Note that we consider running code directly from SRAM for exceptional cases only, such as reaction-time sensitive interrupt handlers, hence the above condition of using procedure variables for all called procedures is not that onerous.

Copying Code from Flash Memory to SRAM

Module MemoryExt provides a facility, CopyProc, to copy a procedure to SRAM. Based on the address of the procedure to be copied, it uses the meta data in the resource block at the end of the program to determine the address of the next procedure in a module, or the module’s init code, including the aforementioned constant values right after the procedure’s executable code.

The memory layout parameters for Astrobe, extended via module Config, only use the lower 256k of the RP2040’s SRAM (four 64k banks, bank 0 to 3), and the lower 512k of the RP2350’s SRAM (eight 64k banks, 0 to 7), respectively. The two 4k banks above are used to store program code in SRAM; for the RP2040 bank 4 for core 0 and bank 5 for core 1, respectively, and for the RP2350 bank 8 for core 9 and bank 5 for core 1, respectively.

Cf. Config.

Pre-caching

Module MemoryExt provides a facility, CacheProc, to load a procedure’s code into cache memory.

Test Cases

Now, finally, the measurements and results. The code for the RP2040 was built using Astrobe for RP2040, for the RP2350 using Astrobe for Cortex-M4.

Run from Flash, Cached (Test Case: RunFromFlashCached)

Set TestCase = RunFromFlashCached, re-build and run. The results are printed to the serial terminal.

RP2040:
run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
 0    16       247         0      30         121             n/a
 1    16       247         0       4          16             n/a
 2    16       247         0       4          16             n/a
 3    16       247         0       4          16             n/a

run: test run number
iter: number of iterations of the measured loops
xip-acc: number of code loads from flash memory via XIP
ram-acc: number of code loads from SRAM
t[us]: total run time in microseconds
t/xip-acc[ns]: run time per XIP access in nano-seconds
t/ram-acc[ns]: run time per RAM access in nano-seconds

Note that the timing precision is only microseconds (us), the nanoseconds (ns) values are just (microseconds * 1000) DIV x. Also, there’s an inherent imprecision of 1 microsecond when reading the timer value.

Since we don’t run any code from SRAM here, the measured values for ram-acc are zero. In the runs from SRAM, the measured values for xip-acc will be zero. The test code uses the same output columns for all test cases, from flash memory and SRAM, for easier comparison.

In the above results, the effect of the caching is clearly visible, with the first run taking significantly longer to load. Note that also for the first run, some caching already takes place, due to the while loops and the called procedure. Hence, the load time per XIP-access is not the worst case, which we will see for the non-cached access.

RP2350, with config option "Thumb Code" = checked:
run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
 0    16       300         0      19          63             n/a
 1    16       300         0       9          30             n/a
 2    16       300         0       5          16             n/a
 3    16       300         0       4          13             n/a

RP2350, with config option "Thumb Code" = unchecked:
run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
 0    16       363         0      26          71             n/a
 1    16       364         0       4          10             n/a
 2    16       364         0       3           8             n/a
 3    16       364         0       3           8             n/a

Run from Flash, not Cached (Test Case RunFromFlashUncached)

Now let’s ignore the flash cache, set TestCase = RunFromFlashUncached, re-build and run.

RP2040:
run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
 0    16       247         0      99         400             n/a
 1    16       247         0      99         400             n/a
 2    16       247         0      99         400             n/a
 3    16       247         0      99         400             n/a

The code load and thus execution times without caching are very long. An interrupt handler running from flash for the first time, or after it has been evicted from the cache due to non-use, will have to cope with these worst case “raw” flash loading times.

An instruction loading time of 400 nanoseconds (0.4 us) demonstrates nicely what “running against the clock” means in a microseconds timing domain.

RP2350, with config option "Thumb Code" = checked:
run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
 0    16       299         0     110         367             n/a
 1    16       299         0     110         367             n/a
 2    16       299         0     109         364             n/a
 3    16       299         0     110         367             n/a

RP2350, with config option "Thumb Code" = unchecked:
run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
 0    16       364         0     125         343             n/a
 1    16       364         0     124         340             n/a
 2    16       364         0     125         343             n/a
 3    16       364         0     125         343             n/a

Run from SRAM (Test Case: RunFromRam)

To load the test code into SRAM, set TestCase = RunFromRam, re-build and run.

RP2040:
run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
 0    16         0       311       6         n/a              19
 1    16         0       311       5         n/a              16
 2    16         0       311       5         n/a              16
 3    16         0       311       5         n/a              16

The load times for the test procedure are consistently low, comparable to the cached loads above.

RP2350, with config option "Thumb Code" = checked:
run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
 0    16         0       396       5         n/a              12
 1    16         0       396       5         n/a              12
 2    16         0       396       5         n/a              12
 3    16         0       396       5         n/a              12

RP2350, with config option "Thumb Code" = unchecked:
run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
 0    16         0       496       5         n/a              10
 1    16         0       496       5         n/a              10
 2    16         0       496       5         n/a              10
 3    16         0       496       5         n/a              10

Run Pre-Cached

Set TestCase = RunPreCached, re-build and run.

RP2040:
run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
 0    16       247         0       5          20             n/a
 1    16       247         0       4          16             n/a
 2    16       247         0       4          16             n/a
 3    16       247         0       4          16             n/a

RP2350, with config option "Thumb Code" = checked:
run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
 0    16       300         0       4          13             n/a
 1    16       300         0       4          13             n/a
 2    16       300         0       4          13             n/a
 3    16       300         0       4          13             n/a

RP2350, with config option "Thumb Code" = unchecked:
run  iter   xip-acc   ram-acc   t[us]   t/xpi-acc[ns]   t/ram-acc[ns]
 0    16       364         0       5          13             n/a
 1    16       364         0       4          10             n/a
 2    16       364         0       4          10             n/a
 3    16       364         0       4          10             n/a

Bottom Line

If we need a guaranteed reaction time of an interrupt handler, we better install it in, and execute it from SRAM. Loading from flash is rather slow. For an interrupt handler that is triggered on a regular basis, such as a milliseconds SysTick, we can be confident – but not sure – that its code is in the on-chip cache, from where it will be executed as fast as installed in SRAM.

If the first execution of an interrupt handler must be fast, it cannot reside in flash. In general, if you measure execution times of any program right after reset, you’ll notice the caching effect on all the code.

Pre-caching is a simple way to load code into the cache, but it may be evicted if not used for some time. This cannot happen with code in SRAM, obviously.

The possibility of pinning cache lines to specific XIP addresses to ensure the corresponding code is in the cache was not explored and measured here.

Caveat

Module RuntimeErrors will currently not produce valid stack traces for code in SRAM.

Output Terminal

See Set-up, one-terminal set-up.

Build and Run

Repository

libv2:
- for RP2040/Pico: CodeLoading.mod
- for RP2350/Pico2: CodeLoading.mod
lib (v1): this version is no longer being maintained:
- CodeLoading.mod

There are RP2350 variants that have flash memory on the chip, but the one used on the Pico 2 does not. ↩︎