Random notes as I wrote a PCI kernel module

This post was written by eli on March 20, 2011
Posted Under: Linux kernel

These are a bunch of things I jotted down as I wrote a Linux kernel module for a PCI express peripheral I developed.

About kernel module Makefiles

A great guide here.

lspci and setpci

lspci is quite well-known. What is less known, is that it can be used to get the tree structure of bridges and endpoints with

$ lspci -tv

lspci can also be used to get cleartext info about any card using the -x or -xxxx flag, possibly along with -v. The little trick with -x is that the byte ordering is wrong in little endian systems, and reflects the host’s byte ordering rather than PCI’s native big endian.

setpci is useful to look at specific registers in the card’s configuration space (and possibly alter them). For example, look at two registers of a PCI card picked by Vendor/Product IDs (no need to be root on my computer):

$ setpci -v -d 10b7:9300 BASE_ADDRESS_0 BASE_ADDRESS_1
0000:05:05.0 @10 = 0000bc01
0000:05:05.0 @14 = fbbfe000

To get a list of registers (such as BASE_ADDRESS_0) just go

$ setpci --dumpregs

__devexit_p ???

static struct pci_driver xillybus_driver = {
  .name = "xillybus",
  .id_table = xillyids,
  .probe = xilly_probe,
  .remove = __devexit_p(xilly_remove),
};

The __devexit_p is a macro turning into NULL when the module is compiled into the kernel itself (as opposed loadable module) and isn’t hotpluggable. Since the module can’t ever exit, this gives the compiler an opportunity to optimize away the function alltogether?

Talking with your PCI device in 5 steps

(or: How to initialize a PCI device in the probe function)

  • Enable the device: pci_enable_device(pdev);
  • Check that it’s mapped as you expect it (optional): if (!(pci_resource_flags(pdev, the_bar_you_want) & IORESOURCE_MEM)) { fail here }
  • Declare yourself the exclusive owner of the device (give or take /dev/mem): pci_request_regions(pdev, “your-driver-name”);
  • Get a handle for I/O operations. Don’t let the function’s name confuse you. It applies to memory BARs as well as I/O BARs: pointer = pci_iomap(pdev, the_bar_you_want, length);

And when this is successful, reading and writing is possible with iowrite32(i, pointer) or ioread32(pointer).

“pointer” is just a pointer to any type (considered void by kernel), and there are 16 and 8 bits operators as well. Keep in mind that the PCI bus is big endian, so when putting a 32-bit number on the bus and reading it as 32 bit on a x86 machine, you get it twisted around.

It’s also common to allow the device to be master on bus (necessary for DMA access and MSI interrupts: pci_set_master(pdev);

As for includes, I got away with only this set:

#include <linux/pci.h>
#include <linux/device.h>
#include <linux/io.h>

Can the probe function sleep?

That was important for me, because I’m went  for a heavy setup process involving DMA transfers in the very beginning. The following encouraging sentence was found in Documentation/pci.txt, section 1:

The probe function always gets called from process context, so it can sleep.

When PCIe card isn’t responsive

If the PCIe interface logic does nothing about requests sent to a legal BAR address, it’s OK with iowrite32() operations (since they’re posted, so nobody bothers to check what happened with them), but ioread32() will make my computer freeze. It’s a complete kernel crash without even an oops message. It looks like the processor locks on waiting for the completion packet, which never arrives.

Conclusion: Messing with the FPGA while the host PC is on will most likely hang the entire system, if the host attempts to do something with that interface.

This is actually surprising, because section 2.8 of the PCI Express spec 1.1 is very clear about a mandatory timeout mechanism, and that “The Completion Timeout timer must expire if a Request is not completed in 50 ms.” Is this my G31 chipset not meeting spec?

DMA through PCI calls?

Several writeouts resembling the kernel’s own DMA-API-HOWTO.txt are out there. I go for this one, because it says a few words about using the PCI-specific functions, which are used extensively in the kernel code.

__get_free_page(s) vs. alloc_page(s)

That’s confusing. Both get you either one page of PAGE_SIZE of memory (4096 bytes in most cases), or several of them (the -s versions), but __get_free_pages returns the actual address as kmalloc, while alloc_pages returns page information via a struct page pointer. For DMA, the former’s memory chunk is mapped with e.g. pci_map_single(), but the latter with dma_map_page(). Some more details here.

MSI interrupt handlers

Unlike classic IRQ handlers, which may be called because a shared interrupt fired off, there’s no reason to check if an interrupt is really pending. Yet another good reason to use MSI.

Volatile on wake-up conditions?

If I’m waking up on some condition like

wait_event_interruptible(privdata->waitq, (privdata->flag != 0));

should the “flag” entry be marked as “volatile” in the privdata structure? I mean, an interrupt will change its value, so “volatile” is the old way to do this. The answer is no, no, and again no. One should trust the kernel’s API to handle the volatility issue, since the kernel’s underlying opinion (which seems pretty justified) is that “volatile” is inherently bad.

Mutexes and spinlocks

Things I always heard people say about mutexes, and still worth to emphasize:

  • Use as much mutex granuality as possible, as long as it’s structured enough not to mess up into a deadlock.
  • Set up a logical who’s-locked-first schema, and write it as a comment in the code. It goes something like mutex_rd -> mutex_wr -> spinlock_rd -> spinlock_wr and so on. So it’s absolutely clear that if mutex_wr is taken, mutex_rd must not be asked for, but spinlock_wr may. One is allowed to skip mutexes in the list, but not go backwards.
  • Whenever possible, release mutexes before going to sleep (waiting for events in particular, and even more in particular if the waiting has no timeout). For example, waiting for data in a read() method handler with the mutex taken may sound reasonable if the mutex is only related to read() operations, but what if the device is open()ed a second time during this sleep? All of the sudden the open() blocks for no apparent reason.
  • … but this makes it a bit tricky: We’re awaken because our condition was met (say, data arrived) but while trying to take the mutex for handing the event, the data was read() by some other process, and the state data structure was also completely changed. So after regaining the mutex, the overall situation must be reassessed from scratch.
  • And of course, no sleeping with spinlocks on.
  • Use the wait_event_* functions’ return values to tell why we were woken up, and don’t rely on the condition tested for that. In particular if we sleep without mutex, because the flag be altered by another process from the moment the event was triggered until our process actually got the CPU.

Add a Comment

required, use real name
required, will not be published
optional, your blog address