Kernel Segfaults for Fun (but no profit)

Stephen Brennan • 03 November 2016

In “episode 2” of my kernel development series, I’m going to talk about how I put Python into an uninterruptible sleep. This spooky story involves a rogue kernel module, segmentation faults, and reference counting (a topic already well established to be spooky). And only a few days late for Halloween!

Update: I did this: https://t.co/Ay4f2Y31NZ
— Stephen Brennan (@brenns10) September 28, 2016

For developers used to working in user-space—like me—the kernel can be a difficult adjustment, in lots of ways. One particular difficulty is that the kernel feels very opaque. There’s no GUI to watch, nor is there a simple way to attach a debugger and step through your code. You can write to the log like a normal program, but that’s pretty much the end of the similarities.

A nice trade-off is that the kernel provides several critical services to your computer, such as the filesystem and the network stack. So if you want to interact with the kernel, you can use these services in “special” ways. To go along with that, Linux has inherited the “everything is a file” mindset from Unix. As a result, a standard way to exchange information with user-space is via special files. For example, the entire /proc directory contains special kernel files, such as /proc/cpuinfo, which can give you information about your processor, or /proc/uptime, which gives you uptime info.

Custom Character Devices

There are plenty of ways to hook into the filesystem from the kernel, but the way described by the Linux Kernel Module Programming Guide is to create a new type of character device. Basically, you create a kernel module implementing some file operations and register them with the kernel. Then, from userspace you create a new character device file using mknod, and suddenly you can talk to your kernel module’s code very easily!

In the example they present, we create a kernel module which implements a device file that, when read, reports the number of times it has been opened. The basic idea is that you create a struct containing pointers to implementations for a few functions - read(), write(), open(), and close() being the most important. This struct gets registered with the kernel.

static struct file_operations fops = {
  .read = device_read,
  .write = device_write,
  .open = device_open,
  .release = device_release
};

The module maintains some static variables, most importantly a buffer for the actual text of the file, a “read pointer” for keeping track of the location within the file, as well as a flag for whether the file has been opened.

static int Device_Open = 0;
static char msg[BUF_LEN];
static char *msg_Ptr;

When a process tries to open the device, the following function is executed:

static int device_open(struct inode *inode, struct file *filp)
{
  static int counter = 0;

  if (Device_Open)
    return -EBUSY;

  Device_Open++;
  sprintf(msg, "I already told you %d times Hello world!\n", counter++);
  msg_Ptr = msg;
  try_module_get(THIS_MODULE);

  return SUCCESS;
}

First, we check to see whether or not the device is currently opened elsewhere—if so, we return an error¹. Then, we fill up the buffer with a message that we create based on the number of times the file has been opened. Finally, we use the try_module_get() function, which I’ll explain a lot more in just a little bit.

Next, when the process reads from the file, our read function copies the data into their buffer:

static ssize_t device_read(struct file *filp, /* see include/linux/fs.h   */
                           char *buffer,      /* buffer to fill with data */
                           size_t length,     /* length of the buffer     */
                           loff_t *offset)
{
  int bytes_read = 0;

  if (*msg_Ptr == 0)
    return 0;

  while (length && *msg_Ptr) {
    /*
     * The buffer is in the user data segment, not the kernel segment so "*"
     * assignment won't work. We have to use put_user which copies data from the
     * kernel data segment to the user data segment.
     */
    put_user(*(msg_Ptr++), buffer++);
    length--;
    bytes_read++;
  }

  return bytes_read;
}

An interesting thing to note is the use of put_user(), which is necessary because memory addresses are mapped to different places in the kernel versus user-space, so pointers from user-space don’t point to the correct things in kernel-space!

Finally, when the device is closed we decrement our usage count.

static int device_release(struct inode *inode, struct file *filp)
{
  Device_Open--;
  module_put(THIS_MODULE);
  return SUCCESS;
}

Again, we see a strange module_put() call, but let’s disregard that for a moment longer.

The remainder of the module contains an init and exit function to register and de-register the character device. I won’t bother to put those here. There is also a write() function that always returns an error, because writing to this file doesn’t make sense. You can see the complete code in this gist, which also contains a Makefile.

To try it all out, follow the steps below:

$ make
$ sudo insmod chardev.ko
$ dmesg | tail
# Read the message printed, and use the provided command to create a device
# file.
# EG:
$ sudo mknod /dev/chardev c 242 0
$ cat /dev/chardev
I already told you 0 times Hello world!
$ cat /dev/chardev
I already told you 1 times Hello world!

# Don't forget to clean up.
$ sudo rm /dev/chardev
$ sudo rm mod chardev

As you can see, the file behaves mostly like a normal file. It can be opened with your normal Linux utilities. The only noticeable difference here is that the number in the file increments each time it is opened.

Module Reference Counts

So what’s with try_module_get() and this module_put() thing? They are actually module reference counts! Linux’s kernel module system is very cool, allowing code to be loaded and unloaded from the kernel! This is nice, but what happens if the user tries to remove a kernel module while it is in the middle of some important operation? This fails, because a correctly written module uses try_module_get() to indicate when it is in the middle of an operation (like when its device file is open).

Of course, this immediately raises an exciting question: what happens if you disregard these safeguards? What would happen if you opened a device file, removed the module implementing it, and then tried to use the file? We can try this very easily by simply removing our get and put calls!

A nice quick way to try this is with a python shell. First, make sure that you recompile, load the module, and create the device file. Then, in a separate terminal, pop open a Python shell and open the file:

$ python
>>> f = open('/dev/chardev', 'r')

Now, go back to your original terminal and rmmod the character device. Without the reference counts, the kernel happily removes the module and its data. Of course, there is still an open file out there containing pointers to our read and write functions, which are now simply patches of freed memory (or worse, memory belonging to newly loaded code). So, when we try to use the file, things go very, very wrong:

>>> f.read()

...

...

The process is no longer responsive, even to Control-C or any signal you can send it! The reason can be quickly discovered by checking your dmesg output. You’ll see a whole bunch of debug information, along with a report that says something like:

BUG: unable to handle kernel paging request at ffffffffa15ec008

For an example of the entire stack trace produced, check this out.

During the system call, the kernel had a segfault! When this happens to a user-space process, the kernel just sends SIGSEGV to the offending process, which typically kills it (unless the process explicitly handles the signal). But you can’t just kill the whole kernel just because some crappy kernel module developer named Stephen Brennan caused a segfault. So the kernel decides that the safest way to handle it is to suspend the process. It marks the process with TASK_UNINTERRUPTIBLE and puts it to sleep, so that the process will never be scheduled to run, and no signals may be delivered to it.

You can even check on the process with ps to confirm this:

$ ps `pgrep python`
  PID TTY      STAT   TIME COMMAND
32580 pts/5    D+     0:00 [python]

The D tells us that the process is in uninterruptible sleep! This (nearly) zombie process will be bumbling around your computer until you reboot it. Thankfully it’s fairly harmless (unless you have lots of them dancing around your computer).

If you liked this article, check out my other kernel development articles: Episode 1, Episode 3

Footnotes:

Note that this is spectacularly poor synchronization. Two processes could concurrently open the file, and both make it past the if statement before incrementing Device_Open. If we truly wanted mutual exclusion, we would need to use some sort of locking mechanism, like a spinlock or mutex. In this case, it doesn’t really matter which (though in general that’s not true). ↩

Legal • RSS

Stephen Brennan's Blog is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License