KernelPerf

Overview

This document and the accompanying test program KernelPerf explore and evaluate a few key performance characteristics of kernel-v1.

The following aspects are measured and documented:

  • basic behaviour overview
  • scheduler timing and overhead
  • thread context switch time and overhead
  • behaviour with threads extending the scheduler tick time
  • different thread and scheduler tick configurations

Program Description

Structure

The test program KernelPerf implements:

  • 16 threads, of which
    • threads 1 to 15 execute the same code with a simple counting loop
    • thread 0 prints to the terminal

The example program directory contains a version of modules Kernel and UARTkstr that operate GPIO pins to measure the timing using an oscilloscope. Other than that, these modules Kernel are unchanged.

Test Cases

Program module KernelPerf is structured furthermore into test cases that can be selected using KernelPerf.TestCase, which does all the required set-up and selection of code paths in the different procedures.

Measurements Output

The timing behaviour is captured using an oscilloscope, measuring GPIO pins set in module Kernel specific for this program. We’ll explain the meaning of the four corresponding signal traces below.

Even though we print to the terminal, the output is neither needed or meaningful. In fact, due to the amount of output per time, you may even prefer to disconnect the terminal program.

Base Test Settings

  • The kernel runs with a tick time of 8 milliseconds (ms).
  • Each thread is set to run with the same period as the tick time, ie. must be scheduled to run with each tick.
  • The counter loops in threads 1 to 15 are set to nearly fill the available time between two kernel ticks, hence, we have close to a “full load” situation.

Set-up differences, if any, are described in the relevant test cases.

Test Case 0

This is the main test case to explore several aspects.

Overview

The pictures may be clearer if you go full-screen by clicking the icon at the top right (dark background).

Signal traces, with a high value meaning:

  • yellow: any thread is running
  • red: the scheduler is running
  • blue: a thread is being put on the run-queue, low: status checking
  • green: thread 0 is running

We see the scheduler running each 8 ms (red), and the 16 threads each scheduled and running within the 8 ms (yellow), with some reserve. See below for an expanded view of the other signal traces.

Scheduler

With 16 threads, it takes the scheduler about 58 microseconds (us) to process them all (red), ie. to determine their status, and to put all of the 16 on the run-queue (blue).

“Slotting into” the ready queue (blue = high) takes up about 35 microseconds for 16 threads of the same priority, with each additional thread adding about 0.2 us starting from 0.6 us for thread 0, since we linearly traverse the linked list. The status checking time is constant about 1.3 us per thread (blue = low, times measured with the screen cursors, so it’s give or take a bit).

With an 8 ms scheduler tick time, 58 us is 58/8,000 = 0.725% overhead. With fewer threads to put on the run queue during one tick, or longer tick times, this overhead goes down, but with shorter tick periods is goes up, since the 58 us are constant for this thread and scheduler set-up (see below).

Thread Context Switch

A thread context switch takes about 1.25 microseconds, right from before Kernel.Next until right after, ie. including all “down time” for threads.

That is, the 16 threads waste about 16 * 1.25 us = 20 us for context switching per tick period, which is about 0.25% with an 8 ms scheduler tick period.

Start-up: Flash Caching

More of an aside here, if we record the signals right after a hardware reset of the MCU, we see how flash caching kicks in (red, blue).

Test Cases 1 and 2

These test cases document the behaviour when the threads together use more run time than available within one scheduler tick. To achieve this, we increase the counter loop limit for threads 1 to 15.

There are two screenshots, you can go back and forth with the right/left icons.

The screenshots show increasingly longer run times.

With longer active run times, the threads start to push the invocation of the scheduler towards longer tick times. Hence, the overall timing falls apart, but the scheduling still works. Nothing “breaks”, the program continues also under these error conditions. If this situation is temporary, it may or may not be acceptable for the specific use case. Obviously, if it’s permanent, we need to re-design our program.

Test Cases 3 to 7

These test cases looks at the string output of thread 0.

  • Test cases 3, 4, 5: output via UARTstr.PutString, as configured in the standard Main library module, with different strings lengths of 30, 40, and 50 characters.
  • Test cases 6, 7, 8: output via UARTkstr.PutString, as re-configured in the program module KernelPerf, with the same test strings.

Test Cases 3, 4, and 5

Three screenshots, one for each string length (30, 40, or 50 chars).

Test case 3, writing 30 characters to the UART, does not exceed the corresponding FIFO capacity, and neither the time between scheduler ticks required by the UART to transmit all characters at 38,400 Baud (see Main). Writing 32 characters, ie. the FIFO capacity, would push the limits, since they require about 8.3 ms to be transmitted at this baudrate.

Test cases 4 and 5, with 40 and 50 characters, respectively, exceed both the FIFO capacity as well as the time between scheduler ticks required by the UART to transmit, hence the busy-waiting of UARTstr.PutString kicks in and extends the time between scheduler runs.

Test Cases 6, 7, and 8

Three screenshots, one for each string length (30, 40, or 50 chars).

Test case 6 is just as test case 3, since we don’t exceed the FIFO capacity and the time between scheduler ticks required by the UART to transmit.

Test cases 7 and 8 show how the scheduler timing is not impacted even with the longer strings, since UARTkstr.PutString suspends thread 0, and awaits the FIFO empty hardware flag to continue with the rest of the output string. Looking at the screenshot closely reveals the suspended states of thread 0 as missing green spikes.

Test Cases 9, 10, 11, and 12

To conclude, four different thread and tick frequency set-ups.

Test Case 9

Eight of 16 threads run at half the scheduler tick frequency, eight at tick frequency.

Three screenshots, overview and two scheduler details, one when all 16 threads are scheduled, one when only which threads are scheduled.

The overview confirms that half of the threads are only scheduled every other tick.

The scheduler details show the same behaviour regarding run time for putting the threads on the run-queue (blue = high). The run time for state scanning (blue = low) shows how all threads are checked, even if not put on the the run-queue, ie. the gaps between blue = high get wider when only eight threads are scheduled. The overall scheduler time in the latter case goes down to about 32 microseconds.

Test Case 10

Only eight threads, running at the scheduler tick frequency.

Two screenshots, overview and scheduler detail.

Compared to test case 9, the run time spent for scanning all installed threads for their state (blue = low) goes down. The run times for putting them on the run-queue (blue = high) is the same. The overall scheduler run time is about 22 microseconds.

Test Case 11

The priority of 16 threads is increased by 1 for each thread installed.

Two screenshots, overview and scheduler detail.

Since threads with higher thread id, ie. the ones scanned for their state later in the scheduler loop, have higher priority, the run time for putting them on the run-queue remains the same for each thread, since they are inserted at the beginning of the queue. The overall scheduler run time is about 32 microseconds.

Note how thread 0 (green) now runs last in the cycle since it has the lowest priority.

Test Case 12

The scheduler tick period is set to 4 milliseconds, with corresponding reductions of the max counter values for threads 1 to 15, so they still fit the execution window.

Two screenshots, overview and scheduler detail.

The overall scheduler run time remains the same with shorter tick periods. That is, the scheduler overhead increases: 58 out of 4,000 microseconds is about 1.5%.

Summary

We can summarise the results thusly:

  • The scheduler takes 58 microseconds to run, with

    • 16 threads to check, and
    • 16 threads to put on the run queue

    each scheduler run (tick).

  • With a scheduler tick time of 8 milliseconds that’s a 0.7% overhead.

  • Other thread set-ups or scheduler tick frequencies result in overheads of 0.3 to 1.5% in the above test cases. With a mix of different thread periods and priorities we can expect about 1% scheduler overhead (rule of thumb, needs confirmation for each use case if critical).

  • The thread context switch time is 1.25 microseconds. That is, the 16 threads waste about 16 * 1.25 us = 20 us for context switching, which is about 0.25% with an 8 ms scheduler tick period.

  • The scheduler degrades gracefully in case the cumulative run time of all threads on the ready-queue exceeds the tick time.

  • Busy-waiting can be replaced by awaiting a “continue” hardware flag to be set or cleared.

  • The scheduler’s run time for putting the ready threads on the run-queue could be optimised (for example head and tail pointers into the run-queue for each priority), but with a scheduler overhead of around 1% the added complexity may not be justified.

  • Bottom line, I am pretty satisfied with this kernel: low overhead, fast context switch, degrading gracefully on timing errors, simple and lightweight in concept and implementation.

Output Terminal

See Set-up, one-terminal set-up.

As mentioned above, the output is neither used nor meaningful, and due to the amount of output, you may prefer to even disconnect the terminal program.

Build and Run

Build module KernelPerf with Astrobe, and create and upload the UF2 file using abin2uf2.

Set Astrobe’s memory options as listed, and the library search path as explained. The program uses kernel-v1, so the corresponding module directory must be on the search path.

Repository