Multitasking, like many services an operating system provides, is something we take for granted so much that it can feel mundane. With our powerful smartphones and computers, the idea of a computer not being able to juggle hundreds of processes feels alien. I think it’s features like this that make computers incredibly useful, but also make them feel so complicated and magical.
It’s hard to play around with the code that implements multitasking, and it’s not obvious how to implement it yourself without building a whole OS. I’m a firm believer that you don’t truly understand something until you’ve implemented it yourself, so I wanted to write an article that lets people play around with a simple thread implementation. In this post, we’ll implement simple threads in a normal C program (not an operating system).
This scheduler is going to rely heavily on the functions setjmp()
and
longjmp()
. They feel a bit magical, so I want to first describe what they do,
and spend a little time demystifying how they do it.
The function setjmp()
is a way of recording information about where a program
is in its execution, so that you can later jump back to that point. You give it
a variable of type jmp_buf
, in which it will store that information.
setjmp()
returns 0 the first time it returns.
Later on, you can use the function longjmp(jmp_buf, value)
to immediately
begin execution back at the point where you called setjmp()
. To your program,
it will look like setjmp()
returned a second time. The value
argument you
pass to longjmp()
will be returned this time, to help differentiate the second
return. Here’s an example to help illustrate:
#include <stdio.h>
#include <setjmp.h>
jmp_buf saved_location;
int main(int argc, char **argv)
{
if (setjmp(saved_location) == 0) {
printf("We have successfully set up our jump buffer!\n");
} else {
printf("We jumped!\n");
return 0;
}
printf("Getting ready to jump!\n");
longjmp(saved_location, 1);
printf("This will never execute...\n");
return 0;
}
If you compile and run this program, you will get the following output:
We have successfully set up our jump buffer!
Getting ready to jump!
We jumped!
Wild! It’s like a goto statement, but it can even be used to jump outside of a
function. It’s also a lot more difficult to read than a goto, since it looks
like a regular function call. If your code used setjmp()
and longjmp()
liberally, it would be incredibly confusing for anyone (including yourself) to
read.
Like with goto, the common advice is to avoid setjmp()
and longjmp()
.
However, just like with goto, there are some times where it can be useful to use
sparingly, and in a consistent way. A scheduler needs to be able to switch
contexts, and so we’ll have to use these functions responsibly. Most
importantly, we’ll hide the use of these functions from our API, so that users
of our scheduler won’t have to deal with that kind of complexity.
The setjmp()
and longjmp()
functions aren’t designed to support just any
kind of jumping around, however. They’re designed for a pretty particular use
case. Imagine that you are doing something complicated, like making an HTTP
request. This will involve a complicated set of function calls, and if any of
them fail, you’ll need to return a special error code from each one of them.
This leads to code like this, everywhere you call a function (possibly dozens of
times):
int rv = do_function_call();
if (rv != SUCCESS) {
return rv;
}
The idea of setjmp()
and longjmp()
is that you can use setjmp()
to save
your place just before starting something complex. Then, you could centralize
all of your error handling into one place:
int rv;
jmp_buf buf;
if ((rv = setjmp(buf)) != 0) {
/* handle errors here */
return;
}
do_complicated_task(buf, args...);
If any function involved in do_complicated_task()
fails, it would just
longjmp(buf, error_code)
. This means that every function within
do_complicated_task()
can assume that every function call is a success, which
means you can get rid of that error handling code for each function call. (In
practice, this is almost never done, but that’s a separate blog post.)
The big idea here is that longjmp()
only allows you to jump out of deeply
nested functions. You can’t jump back into a deeply nested function which you
had formerly jumped out of. Here’s an illustration of the stack when you jump
out of a function. The asterisk (*)
marks the stack pointer which setjmp()
stored.
| Stack before longjmp | Stack after longjmp
+-------------------------+----------------------------
stack | main() (*) | main()
grows | do_http_request() |
down | send_a_header() |
| | write_bytes() |
v | write() - fails! |
You can see that we only move back up the stack, and so there is no risk of data
corruption. On the other hand, imagine if you wanted to jump between tasks. If
you call setjmp()
and then return, do some other stuff, and then attempt to
resume what you were doing before, you’ll have a problem:
| Stack at setjmp() | Stack later | Stack after longjmp()
+-------------------+------------------+----------------------
stack | main() | main() | main()
grows | do_task_one() | do_task_two() | do_stack_two()
down | subtask() | subtask() | subtask()
| | foo() | | ???
v | bar() (*) | (*) | ??? (*)
The stack pointer which setjmp()
saved will point at a stack frame which no
longer exists, and may have been overwritten at some point with other data. When
you try to longjmp()
back into the function you have already returned from,
you’ll start experiencing some really weird behavior that will probably crash
your program.
The moral of this story is that, if you want to use setjmp() and longjmp() to
jump between complex tasks like this, you need to make sure each task has its
own separate stack. This completely eliminates the problem, because when
longjmp()
resets the stack pointer, it will swap stacks for you, and no stack
overwriting will take place.
That was a bit of a long diversion, but equipped with this knowledge, we should be able to implement userspace threads. To start out, I found it quite helpful to design the API which should be used to initialize, create, and run the threads. Doing this ahead of time really helps understand what we’re trying to build!
void scheduler_init(void);
void scheduler_create_task(void (*func)(void*), void *arg);
void scheduler_run(void);
These functions will be used to initialize the scheduler, add tasks, and then
eventually begin running tasks in the scheduler. Once we start
scheduler_run()
, it will run until all tasks are completed. For tasks which
are running, they will have the following APIs:
void scheduler_exit_current_task(void);
void scheduler_relinquish(void);
The first function will exit the task. A task could also exit by returning from
its function, so this is simply a convenience. The second function is how our
threads will tell the scheduler to let another task run for a bit. When a task
calls scheduler_relinquish()
, it could be suspended a bit for a bit, while
other tasks run, but eventually the function will return and the task can
continue running.
To give a concrete example of the API, here’s a hypothetical use of this API, which we’ll use to test our scheduler:
#include <stdlib.h>
#include <stdio.h>
#include "scheduler.h"
struct tester_args {
char *name;
int iters;
};
void tester(void *arg)
{
int i;
struct tester_args *ta = (struct tester_args *)arg;
for (i = 0; i < ta->iters; i++) {
printf("task %s: %d\n", ta->name, i);
scheduler_relinquish();
}
free(ta);
}
void create_test_task(char *name, int iters)
{
struct tester_args *ta = malloc(sizeof(*ta));
ta->name = name;
ta->iters = iters;
scheduler_create_task(tester, ta);
}
int main(int argc, char **argv)
{
scheduler_init();
create_test_task("first", 5);
create_test_task("second", 2);
scheduler_run();
printf("Finished running all tasks!\n");
return EXIT_SUCCESS;
}
In this example, we create two tasks which run the same function, but they’ll use different arguments so that we can trace their execution separately. Each task iterates a set number of times. Each iteration, it prints out a message and then lets another task run. We would expect to see something like this as the output of this program:
task first: 0
task second: 0
task first: 1
task second: 1
task first: 2
task first: 3
task first: 4
Finished running all tasks!
To implement this API, we’ll need some sort of internal representation of a task, so let’s go ahead and put together fields we’ll need:
struct task {
enum {
ST_CREATED,
ST_RUNNING,
ST_WAITING,
} status;
int id;
jmp_buf buf;
void (*func)(void*);
void *arg;
struct sc_list_head task_list;
void *stack_bottom;
void *stack_top;
int stack_size;
};
Let’s go through the fields one by one. All tasks should be in the “created”
state as soon as they’re created. Once a task starts executing, it will be in
the “running” status, and if a task ever needed to wait for some asynchronous
operation, it could be placed in the “waiting” state. The id
field is just a
unique identifier for the task. buf
contains the data for when we longjmp()
to resume the task. func
and arg
are passed to scheduler_create_task()
and
are necessary for starting the task. The task_list
field is necessary to
implement a doubly linked list of all tasks. The stack_bottom
, stack_top
,
and stack_size
fields all relate to the separate stack allocated for this
task. The “bottom” is the address returned by malloc()
, but the “top” is a
pointer to the address directly above the region of memory. Since the x86 stack
grows downward, we will need to set the stack pointer to stack_top
rather than
stack_bottom
.
Given this, we can implement the scheduler_create_task()
function:
void scheduler_create_task(void (*func)(void *), void *arg)
{
static int id = 1;
struct task *task = malloc(sizeof(*task));
task->status = ST_CREATED;
task->func = func;
task->arg = arg;
task->id = id++;
task->stack_size = 16 * 1024;
task->stack_bottom = malloc(task->stack_size);
task->stack_top = task->stack_bottom + task->stack_size;
sc_list_insert_end(&priv.task_list, &task->task_list);
}
Using a static int
ensures that each time the function is called, the id
field increments to a new number. Everything else should be self-explanatory,
except the sc_list_insert_end()
, which simply adds the struct task
to the
global list. The global list is stored within a second structure, which
contains all the private scheduler data. This structure is presented below,
along with its initialization function:
struct scheduler_private {
jmp_buf buf;
struct task *current;
struct sc_list_head task_list;
} priv;
void scheduler_init(void)
{
priv.current = NULL;
sc_list_init(&priv.task_list);
}
The task_list
field is used to refer to the list of tasks (unsurprisingly).
The current
field is used to store the currently executing task (or null if
none is curently running). Most importantly, the buf
field will be used to
jump into the code of scheduler_run()
:
enum {
INIT=0,
SCHEDULE,
EXIT_TASK,
};
void scheduler_run(void)
{
/* This is the exit path for the scheduler! */
switch (setjmp(priv.buf)) {
case EXIT_TASK:
scheduler_free_current_task();
case INIT:
case SCHEDULE:
schedule();
/* if return, there's nothing else to do and we exit */
return;
default:
fprintf(stderr, "Uh oh, scheduler error\n");
return;
}
}
As soon as the scheduler_run()
function is called, we set the setjmp()
buffer so we can always return to this function. The first time, 0 (INIT
) is
returned, and we immediately call schedule()
. Subsequently, we can pass the
SCHEDULE
or EXIT_TASK
constants into longjmp()
, which will trigger
different behaviors. Let’s ignore the EXIT_TASK
case for now, and go directly
into the implementation of schedule()
:
static void schedule(void)
{
struct task *next = scheduler_choose_task();
if (!next) {
return;
}
priv.current = next;
if (next->status == ST_CREATED) {
/*
* This task has not been started yet. Assign a new stack
* pointer, run the task, and exit it at the end.
*/
register void *top = next->stack_top;
asm volatile(
"mov %[rs], %%rsp \n"
: [ rs ] "+r" (top) ::
);
/*
* Run the task function
*/
next->status = ST_RUNNING;
next->func(next->arg);
/*
* The stack pointer should be back where we set it. Returning would be
* a very, very bad idea. Let's instead exit
*/
scheduler_exit_current_task();
} else {
longjmp(next->buf, 1);
}
/* NO RETURN */
}
First, we call an internal function to select the next task which should be run.
This is going to be a simple round-robin scheduler, so it just chooses the next
ready task in the task list. If this function returned NULL, then we have no
more tasks to run, and we return. Otherwise, we need to either start the task
running (if it is in the ST_CREATED
state) or resume running it.
To start a created task, we use an x86_64 assembly instruction to assign the
stack_top
field to the rsp
register (stack pointer). Then we change the task
state, run the function, and exit if the function returns. Note that setjmp()
and longjmp()
store and swap stack pointers, so this is the only time where
we’ll need to use assembly to modify the stack pointer.
If the task has already been started, then the buf
field should contain the
context we need to longjmp()
into to resume the task, so we just do that.
Next, let’s look at the helper function which selects the next task to run. This
is the heart of a scheduler, and like I said earlier, this is a round-robin
scheduler:
static struct task *scheduler_choose_task(void)
{
struct task *task;
sc_list_for_each_entry(task, &priv.task_list, task_list, struct task)
{
if (task->status == ST_RUNNING || task->status == ST_CREATED) {
sc_list_remove(&task->task_list);
sc_list_insert_end(&priv.task_list, &task->task_list);
return task;
}
}
return NULL;
}
If you’re unfamiliar with my linked list implementation (which is taken from the
Linux kernel), that’s ok. The sc_list_for_each_entry()
function is a macro
that lets us iterate over each task in the task list. The first eligible (not
waiting) task we find is removed from its current position and inserted at the
end of the task list. This ensures that next time we run the scheduler, we’ll
get a different task (if there is another). We return this first eligible task,
or NULL if there were no tasks at all.
Finally, let’s get to the implementation of scheduler_relinquish()
to see how
a task can switch itself out:
void scheduler_relinquish(void)
{
if (setjmp(priv.current->buf)) {
return;
} else {
longjmp(priv.buf, SCHEDULE);
}
}
This is the other use of the setjmp()
function in our scheduler. As such it
can be slightly confusing. When a task calls this function, we use setjmp()
to
save our current context (which includes the current stack pointer). Then, we
use longjmp()
to enter into the scheduler (back in scheduler_run()
), and we
pass the SCHEDULE
function asking to schedule a new task.
When the task gets resumed, the setjmp()
function will return non-zero and
we’ll return out to whatever the task was doing before!
Finally, here’s what happens when a task exits (either by explicitly calling the exit function, or by returning from its task function):
void scheduler_exit_current_task(void)
{
struct task *task = priv.current;
sc_list_remove(&task->task_list);
longjmp(priv.buf, EXIT_TASK);
/* NO RETURN */
}
static void scheduler_free_current_task(void)
{
struct task *task = priv.current;
priv.current = NULL;
free(task->stack_bottom);
free(task);
}
This process comes in two parts: the first function is called directly by the
task. We remove the task’s entry from the task list, since it should no longer
be scheduled. Then, we longjmp()
back to the scheduler_run()
function. This
time, we use EXIT_TASK
. This indicates to the scheduler that, before it
schedules a new task, it should first call scheduler_free_current_task()
. If
you scroll back up to scheduler_run()
, you’ll see this is exactly what
scheduler_run()
does.
We have to do this in two parts because, when scheduler_exit_current_task()
is
called, it is actively using the stack contained in the task struct. If you free
the stack while still using it, there’s the chance that the function will still
access the very stack memory we just freed! To ensure this doesn’t happen, we
have to longjmp()
back to the scheduler, which is using a separate stack. Then
we can safely free the task’s data.
With that, we’ve covered the entire implementation of this scheduler. If you were to go ahead and compile this, along with my linked list implementation and the main program above, you would have a working scheduler! Instead of all that copying and pasting, I’d encourage you to check out the github repository which contains all this code.
If you’ve gotten this far, I assume I don’t need to convince you that this is
interesting. However, it may not seem all that useful. After all, you can use
“real” threads in C, which can run in parallel and don’t need to wait for each
other to call scheduler_relinquish()
.
However, I see this as a jumping off point for a whole series of exciting implementations of useful features. For I/O heavy tasks, this could also be used to simply implement a single-threaded async application, the way that Python’s new async utilities work. This system could also implement generators and coroutines. Finally, with enough effort, this system could even be coupled with “real” operating system threads to provide more parallelism where necessary. Each of these ideas is a fun project which I’d encourage the reader to try before I get around to writing a new article about them!
I mean, probably not! It’s probably not safe to use inline assembly to modify the stack pointer. Don’t use it in your production code, but do use it to mess around and explore!
A safer implementation of this system could be built on the “ucontext” API (see
man getcontext
), which provides a way to swap between these types of userspace
“threads” without needing to meddle with inline assembly. Unfortunately, the API
is non-standard (it was removed from the latest POSIX spec). However, you can
still use this API, as it is part of glibc
.
As it is currently written, this scheduler only works if threads explicitly hand off control back to the scheduler. This is bad for a general purpose implementation like an operating system, because a poorly behaved thread could prevent all the others from running. (Of course, that didn’t stop MS-DOS from using cooperative multitasking!). I don’t think that makes cooperative multitasking bad, it’s just going to depend on the application.
If one used the non-standard “ucontext” API, then POSIX signals would actually store the context of the previously executing code. By setting a periodic timer signal, a userspace scheduler could actually get preemptive multitasking working! This is another really cool project that I hope to try out and write about soon.
If you’ve gotten this far, thanks for reading, and I hope you get the chance to try out a fun project based on this!