Linux kernel programming: Do I need a lock?

This post was written by eli on April 11, 2020
Posted Under: Linux kernel

Introduction

Writing a device driver for Linux (or other kernel programming) always requires keeping parallel execution in mind. It’s often enough to follow common programming patterns, using spinlocks and mutexes in the same way that everyone else does. But somehow, I always find myself looking at my own code, and ask myself: Am I absolutely sure that this will work? Isn’t there any crazy race condition that will show up once in a while and make things fail horribly in the most elusive way?

Or even worse: Will this work properly if CPU 1 does this and CPU 2 does that? When can I just ignore the fact that one piece of code will run on one CPU and the other piece on another?

Maybe because my main expertise is logic design (FPGA), I have this thing about looking for by-definition guarantees for proper functionality. This is a good habit in programming in general, I believe, and in kernel programming in particular.

I’ve already written a post with a similar topic, but at the time the focus was on memory mapped I/O. This somewhat messy post it’s all about the programming model. Quite ironically, it’s difficult to find the right order of topics in a post discussing parallel execution.

Comments and corrections in particular are welcome in comments below. The main incentive to write this post is actually to validate by view on this topic.

The curse of awareness

One may do kernel programming happily until the concept of memory barriers come to mind. The awareness that both the compiler and the processor may reorder, optimize away or even add read and write operations to memory kind makes one wonder how any trivial piece of code even works. The notion that memory writes may not propagate as once might expect between CPUs in an SMP processor (they all are nowadays) undermines that basic, naïve, assurance that all variables have the values that were last assigned to them, even if that took place earlier by a different function. I mean, what if that function ran on a different processor? Did the changes all make it to the new one?

Memory barriers, by themselves, offer a disappointing level of assurance: An SMP write barrier (smp_wmb() ) just says that all writes until that barrier are guaranteed to have been propagated before the writes that came after it. The SMP read barrier (smp_rmb() ) likewise says that all reads before the barrier are guaranteed to appear to have been done before those that came after the barrier.

This isn’t very convincing, but at least is allows setting up a “valid” flag for some data. For example, one thread can fill a data structure, call smp_wmb() and then write to a flag in the data structure that says it’s OK for use. The other thread can busy-wait on that same flag, and once it’s set, call smp_rmb() and then access the data. Because of the use of both barriers, the data is guaranteed to be valid when it’s read by the second thread.

But who’s being thinking about memory barriers? What about all code I’ve written until now? Did it work just by chance?

As I’ve noted in that earlier post, the SMP memory barriers translate into nothing on an x86 platform. This is really bad news, because most of us develop on these platforms, so if the barrier jugglery is done incorrectly, it shows only when the code runs on a different platform. In other words, the complaint will come later and out of nowhere.

Read the kernel docs

Before diving deeper into the subtle issues, here’s a list of references. The related docs in the kernel’s sources (as of v5.3, these tend to move):

  • Documentation/memory-barriers.txt
  • tools/memory-model/Documentation/recipes.txt
  • tools/memory-model/Documentation/explanation.txt
  • Documentation/atomic_t.txt
  • Documentation/core-api/workqueue.rst

The simple rule

If you’re absolutely sure that a taking a spinlock or mutex will never wait for anything, don’t take it. Maybe consider a memory barrier.

Put otherwise: If there’s a code segment A, that is guaranteed to reach its end before code segment B begins its execution, it’s functionally guaranteed to behave as if they ran sequentially on the same processor — all data is updated at the beginning of segment B as at the end of segment A.

The keyword here is guaranteed: It means that some means of the kernel API is relied upon in this matter. There are several such means available, for example:

  • Kernel locks (spinlocks, semaphores, mutexes) or course. But that’s what we’re trying to avoid.
  • Mechanisms that ensure non-reentrance. For example, tasklets and work items are guaranteed to run on one CPU in the entire system at most.
  • Calls to synchronization functions: For example, after calling cancel_work_sync() or flush_workqueue(), the thread that called this functions has the same view as the last related work item (and possibly changes that occurred afterwards, of course).
  • Device drivers can rely on that disconnect() isn’t called along with probe(), and that the release fops method isn’t called before that last read or write method has returned.
  • … and there are tons of other, of course.

The rationale behind this rule is that if this data dependency rule wasn’t guaranteed, virtually nothing in the kernel would work. Synchronization of data is in no programmer’s mind until there’s an explicit possibility for parallel execution (including interrupts). And even when there’s parallel execution in mind, programmers use locks, and (correctly) assume that the ball is passed to the thread holding the lock, along with the entire memory view being synchronized as well.

In short: If you have the relevant lock, or are otherwise guaranteed by the kernel API that nobody’s currently fiddling with the exact same piece of memory, you have the right to ignore any consistency issues. Just because everyone else does, so the kernel API makes sure it actually works that way.

Waking up a waiting process

A special case is the wait queue: One thread calling one of the wait_event*() API functions to sleep until a condition occurs, and a second thread (often an interrupt) calls wake_up*(). If this relationship ensured exclusive execution (i.e. the thread that calls wait_event*() won’t run again until the second one is finished), then the second thread may assume that it’s synchronized with the first thread at the place that it made the wake_up call.

This is important in particular regarding the condition that is passed to the wait_event*() call. It’s however quite common to use a spinlock on the second thread when accessing the common data, since it can’t be guaranteed the the first thread won’t be invoked again (in particular if it’s an ISR).

Lock-this lock-that

So here’s the catch: It’s tempting to take the some kind of lock to feel safer. Something like:

spin_lock_irqsave(&x->spinlock, flags);
safe_value = x->value;
spin_unlock_irqrestore(&x->spinlock, flags);

and with this, have a comfy feeling about safe_value being correct somehow. After all, its value was obtained with protection gear on. But was that protection necessary?

A lock that is inserted today will be very difficult to remove in the future, be it because doing so requires recalling all considerations that are fresh in your memory today. And this unnecessary lock may cause difficulties, in particular for adding features that aren’t in sight today. So by all means, better safe than sorry is not an excuse for a lock. Think it through, for each and every one.

Access with no protection at all

So what happens if we read or write carelessly from a memory region that is possibly accessed by other threads? Well, the first thing to consider is that the compiler may play some nasty pranks with optimization. The classic example is this failing busy-wait loop

while (!x->ready);

which is supposed to wait until x->ready becomes true by virtue of some other thread changing it. Almost any optimizing compiler will interpret this as a single-thread program, hence x->ready won’t change by itself, and implement the equivalent of

if (!x->ready)
  while (1);

or in other words, an infinite loop. At the compiler level, this can be solved with the volatile attribute, however odds are that a patch containing that word will be rejected. It’s considered a red flags saying you’ve misunderstood something. Even though quite a few drivers in the kernel tree use volatile.

Rather, READ_ONCE() and WRITE_ONCE() should be used when an explicit read or write of memory should take place. These prevent the compiler’s optimization (typically by using the volatile keyword, actually) but also the compiler’s reordering of successive READ_ONCE() and WRITE_ONCE(). Note however that these don’t imply memory barriers — they tell the compiler what to do, but not (necessarily) the processor. In particular, the implementation of these for most processors leaves it free to reorder the accesses as long as the result is the same as a single-thread program. In other words: If another CPU fiddles with the same memory regions, this reordering can bite back.

So if memory barriers are used, are READ_ONCE() and WRITE_ONCE() needed? If memory barriers are set correctly along with plain C memory accesses to ensure correct ordering with respect to another CPU, and the compiler’s optimization stunts are neutralized anyhow, why bother? And if it’s for I/O, use readl() / writel() or similar functions.

recipes.txt explains that:

If there are multiple CPUs, accesses to shared variables should use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store tearing, load/store fusing, and invented loads and stores.

Say what? Load/store tearing? That means that, for example, a 32-bit processor executes a write of 32-bit word (an int) in a non-atomic manner? Splitting it into bytes? Huh?

This is discussed further in explanations.txt, in the section named THE READS-FROM RELATION, which also emphasizes:

It requires all accesses to be properly aligned and of the location’s actual size.

To make a long story short: No sane processor will split a natural, aligned access of a work into a non-atomic operation. But to be safe, use READ_ONCE and WRITE_ONCE when it’s important to nail down a memory access, and be sure that the address is aligned to the element’s size.

This discussion has another important implication: When using WRITE_ONCE() and READ_ONCE() with aligned addresses, one can rely on that the value that is read with READ_ONCE() is one of those that were written by some WRITE_ONCE() operation in the past, and never some kind of intermediate junk value, even when it’s read from another CPU. So even if we don’t know exactly when the value of WRITE_ONCE() will reach the other CPUs (not even if it’s before or after other WRITE_ONCE’s), we know it’s always seen with one of the sane values it was assigned with.

As a side note, it’s often tempting to ask oneself whether the WRITE_ONCE will be seen “immediately” by another thread. I just wrote the value, is it on the other CPU yet? And now? So well, it’s not a valid question. If the dependence between the two threads is important, a mechanism should be put in place to ensure synchronization. Locks, memory barriers, whatever. If it isn’t, the other thread should work perfectly fine reading the value before the WRITE_ONCE().

When to use atomic_t

Consider this:

spin_lock_irqsave(&x->spinlock, flags);
x->value++;
spin_unlock_irqrestore(&x->spinlock, flags);

Among others, this makes sure that no increment gets lost because two CPUs were doing it at the same time. But what if this is the only way x->value is changed? Is the spinlock necessary at all? Why not just this?

x->value++;

Like, find me a processor that doesn’t have an atomic increment opcode.

Well, the immediate answer is that the value after the increment may not have propagated to a neighboring processor, so if it also increments the same field more or less at the same time, both add 1 to the same value, and one increment is lost.

So the correct way for doing this is to define x->value of atomic_t type, and then

atomic_inc(&x->value);

possibly without any lock whatsoever. This guarantees that the increment is accounted for properly, even when two CPUs do this in parallel. Or as written in explanations.txt:

What does it mean to say that a read-modify-write (rmw) update, such as atomic_inc(&x), is atomic? It means that the memory location (x in this case) does not get altered between the read and the write events making up the atomic operation. In particular, if two CPUs perform atomic_inc(&x) concurrently, it must be guaranteed that the final value of x will be the initial value plus two.

As for ordering and memory barriers, some commands offer this and others don’t. Refer to atomic_t.txt for more information on that. For example, atomic_inc() gives no ordering protection whatsoever.

There are, of course, a whole range of other atomic operations supported. See atomic_t.txt.

As for plain reads and writes of atomic_c, there are dedicated functions: atomic_read() and atomic_set(), which are implemented with just READ_ONCE() and WRITE_ONCE(). Or as written in atomic_t.txt:

The non-RMW ops are (typically) regular LOADs and STOREs and are canonically implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and smp_store_release() respectively. Therefore, if you find yourself only using
the Non-RMW operations of atomic_t, you do not in fact need atomic_t at all and are doing it wrong.

Simply put: atomic_t is for atomic modifications, like atomic_inc(). Don’t use it where READ_ONCE() and WRITE_ONCE() would do the job literally equally well.

Once again, it’s tempting to ask oneself when and how soon the other CPU sees the incremented value, and the answer remains the same: It should work just fine whether it sees the value before or after the increment, and if it doesn’t, you’re doing it wrong.

Can work items on the same workqueue run concurrently?

Or put the other way around, is sequential execution ensured? If work item A and B are queued on the same workqueue, do I need any locks to protect stuff that both access? Or can I rely on the kernel to prevent concurrent execution?

The truth is that I don’t have a definite answer on this one. Comments are welcome (below).

On one hand, the official kernel docs say

… the worker executes the functions associated with the work items one after the other.

The “one after the other” thing sounds quite reassuring. But does “after the other” relate to after executing the previous item, or after it has finished? That brings me to the other hand, which is reading the source. Namely process_one_work() in kernel/workqueue.c, which is called, among others, by process_scheduled_works() for the obvious purpose. Actually, it’s enough to read the comments associated with the former:

As long as context requirement is met, any worker can call this function to process a work.

and

A single work shouldn’t be executed concurrently by multiple workers on a single cpu. Check whether anyone is already processing the work. If so, defer the work to the currently executing one.

Actually, a single work shouldn’t be executed concurrently at all, as far as I know. But the point is that the test relates to a work item against itself, and it says nothing about other work items.

So my conclusion, for which comments are welcome, is as follows:

  • A lock is required to ensure synchronization between different work items, even if they are queued on the same workqueue…
  • … but things will most likely work perfectly fine without a lock in this scenario, because odds are that only a single worker thread will be allocated for the workqueue. In other words, the fact that it works without a lock doesn’t mean you’ve done it right.
  • Sleeping in a work item is likely to delay other work items in the same workqueue. The fact that a lock is needed to protect against concurrency doesn’t mean that the kernel will produce another work thread when you want it.

Bonus: A note on locks

I didn’t find a good place to push this in, but I still want this on this post.

From recipes.txt, on locking:

Any CPU that has acquired a given lock sees any changes previously seen or made by any CPU before it released that same lock.

Note that it says “seen or made” not just made. In other words, getting the lock gives the complete memory view of the CPU that released it, not just changes it made. Also note that it says “lock” — that means spinlocks as well as mutexes.

This is most likely how the kernel API ensures the synchronization between segments of code that are guaranteed not to run in parallel: There is always some kind of lock on some kind of data structure that says “X can now run, Y can’t”.

Add a Comment

required, use real name
required, will not be published
optional, your blog address

Previose Post: