Microsoft Windows: Atomic ops and memory barriers

This post was written by eli on May 13, 2021
Posted Under: Microsoft,Software,Windows device drivers

Introduction

This post examines what the Microsoft’s compiler does in response to a variety of special functions that implement atomic operations and memory barriers. If you program like a civilized human being, that is with spinlocks and mutexes, this is a lot of things you should never need to care about.

I’ve written a similar post regarding Linux, and it’s recommended to read it before this one (even though it repeats some of the stuff here).

To make a long story short:

  • The Interlocked-something functions do not just guarantee atomicity, but also function as a memory barrier to the compiler
  • Memory fences are unnecessary for the sake of synchronizing between processors
  • The conclusion is hence that those Interlocked functions also work as multiprocessor memory barriers.

Trying it out

The following code:

LONG atomic_thingy;

__declspec(noinline) LONG do_it(LONG *p) {
  LONG x = 0;
  LONG y;
  x += *p;
  y = InterlockedExchangeAdd(&atomic_thingy, 0x1234);
  x += *p;

  return x + y;
}

When compiling this as “free” (i.e. optimized), this yields:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 8B 45 08           mov         eax,dword ptr [ebp+8]
  00000008: 8B 10              mov         edx,dword ptr [eax]
  0000000A: 56                 push        esi
  0000000B: B9 34 12 00 00     mov         ecx,1234h
  00000010: BE 00 00 00 00     mov         esi,offset _atomic_thingy
  00000015: F0 0F C1 0E        lock xadd   dword ptr [esi],ecx
  00000019: 8B 00              mov         eax,dword ptr [eax]
  0000001B: 03 C1              add         eax,ecx
  0000001D: 03 C2              add         eax,edx
  0000001F: 5E                 pop         esi
  00000020: 5D                 pop         ebp
  00000021: C2 04 00           ret         4

First thing to note is that InterlockedExchangeAdd() has been translated into a “LOCK XADD”, which indeed fetches the previous value into ECX and stores the updated value into memory. The previous value is stored in ECX after this operation, which is @y in the C code — InterlockedExchangeAdd() returns the previous value.

XADD by itself isn’t atomic, which is why the LOCK prefix is added. More about LOCK below.

What is crucially important to note, is that putting InterlockedExchangeAdd() between the two reads of *p prevents the optimizations of these two reads into one. @p isn’t defined as volatile, and yet it’s read from twice (ptr [eax]).

Another variant, now trying InterlockedOr():

LONG atomic;

__declspec(noinline) LONG do_it(LONG *p) {
  LONG x = 0;
  LONG y;
  x += *p;
  y = InterlockedOr(&atomic, 0x1234);
  x += *p;

  return x + y;
}

Once again, compiled as “free”, turns into this:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 8B 4D 08           mov         ecx,dword ptr [ebp+8]
  00000008: 8B 11              mov         edx,dword ptr [ecx]
  0000000A: 53                 push        ebx
  0000000B: 56                 push        esi
  0000000C: 57                 push        edi
  0000000D: BE 34 12 00 00     mov         esi,1234h
  00000012: BF 00 00 00 00     mov         edi,offset _atomic
  00000017: 8B 07              mov         eax,dword ptr [edi]
  00000019: 8B D8              mov         ebx,eax
  0000001B: 0B DE              or          ebx,esi
  0000001D: F0 0F B1 1F        lock cmpxchg dword ptr [edi],ebx
  00000021: 75 F6              jne         00000019
  00000023: 8B F0              mov         esi,eax
  00000025: 8B 01              mov         eax,dword ptr [ecx]
  00000027: 5F                 pop         edi
  00000028: 03 C6              add         eax,esi
  0000002A: 5E                 pop         esi
  0000002B: 03 C2              add         eax,edx
  0000002D: 5B                 pop         ebx
  0000002E: 5D                 pop         ebp
  0000002F: C2 04 00           ret         4

This is quite amazing: The OR isn’t implemented as an atomic operation, but rather it goes like this: The previous value of @atomic is fetched into EAX and then moved to EBX. It’s ORed with the constant 0x1234, and then the cmpxchg instruction writes the result (in EBX) into @atomic only if its previous value was the same as EAX. If not, the previous value is stored in EAX instead. In the latter case, the JNE loops back to try again.

I should mention that cmpxchg compares with EAX and stores the previous value to it if the comparison fails, even though this register isn’t mentioned explicitly in the instruction. It’s just a thing that cmpxchg works with EAX. EBX is the register to compare with, and it therefore appears in the instruction. Confusing, yes.

Also here, *p is read twice.

It’s also worth noting that using InterlockedOr() with the value 0 as a common way to grab the current value yields basically the same code (only the constant is generated with a “xor esi,esi” instead).

So if you want to use an Interlocked function just to read from a variable, InterlockedExchangeAdd() with zero is probably better, because it doesn’t create all that loop code around it.

Another function I’d like to look at is InterlockedExchange(), as it’s used a lot. Spoiler: No surprises are expected.

LONG atomic_thingy;

__declspec(noinline) LONG do_it(LONG *p) {
  LONG x = 0;
  LONG y;
  x += *p;
  y = InterlockedExchange(&atomic_thingy, 0);
  x += *p;

  return x + y;
}

and this is what it compiles into:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 8B 45 08           mov         eax,dword ptr [ebp+8]
  00000008: 8B 10              mov         edx,dword ptr [eax]
  0000000A: 56                 push        esi
  0000000B: 33 C9              xor         ecx,ecx
  0000000D: BE 00 00 00 00     mov         esi,offset _atomic_thingy
  00000012: 87 0E              xchg        ecx,dword ptr [esi]
  00000014: 8B 00              mov         eax,dword ptr [eax]
  00000016: 03 C1              add         eax,ecx
  00000018: 03 C2              add         eax,edx
  0000001A: 5E                 pop         esi
  0000001B: 5D                 pop         ebp
  0000001C: C2 04 00           ret         4

And finally, what about writing twice to the same place?

LONG atomic_thingy;

__declspec(noinline) LONG do_it(LONG *p) {
  LONG y;
  *p = 0;
  y = InterlockedExchangeAdd(&atomic_thingy, 0);
  *p = 0;

  return y;
}

Writing the same constant value twice to a non-volatile variable. This calls for an optimization. But the Interlocked function prevents it, as expected:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 8B 4D 08           mov         ecx,dword ptr [ebp+8]
  00000008: 83 21 00           and         dword ptr [ecx],0
  0000000B: 33 C0              xor         eax,eax
  0000000D: BA 00 00 00 00     mov         edx,offset _atomic_thingy
  00000012: F0 0F C1 02        lock xadd   dword ptr [edx],eax
  00000016: 83 21 00           and         dword ptr [ecx],0
  00000019: 5D                 pop         ebp
  0000001A: C2 04 00           ret         4

Writing a zero is implemented by ANDing with zero, so it’s a bit confusing. But it’s done twice, before and after the “lock xadd”. Needless to say, these two writes are fused into one by the compiler without the Interlocked statement in the middle (I’ve verified it with disassembly, 32 and 64 bit).

Volatile?

In Microsoft’s definition for the InterlockedExchangeAdd() function (and all other similar ones) is that the first operand is a pointer to a LONG volatile. Why volatile? Does the variable really need to be?

The answer is no, if all accesses to the variable is made with Interlocked-something functions. There will be no compiler optimizations, not on the call itself, and the call itself is a compiler memory barrier.

And it’s a good habit to stick to these functions, because of this implicit compiler memory barrier: That’s usually what we want and need, even if we’re not fully aware of it. Accessing a shared variable almost always has a “before” and “after” thinking around it. The volatile keyword doesn’t protect against reordering optimizations by the compiler (or otherwise).

But if the variable is accessed without these functions in some places, the volatile keyword should definitely be used when defining that variable.

More about LOCK

It’s a prefix that is added to an instruction in order to ensure that it’s performed atomically. In many cases, it’s superfluous and sometimes ignored, as the atomic operation is ensured anyhow.

From Intel’s 64 and IA-32 Architectures Software Developer’s Manual, Volume 2 (instruction set reference) vol. 2A page 3-537, on the “LOCK” prefix for instructions:

Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted.

The manual elaborates further on the LOCK prefix, but says nothing about memory barriers / fences. This is implemented with the MFENCE, SFENCE and LFENCE instructions.

The LOCK prefix is encoded with an 0xf0 coming before the instruction in the binary code, by the way.

Linux counterparts

For x86 only, of course:

  • atomic_set() turns into a plain “mov”
  • atomic_add() turns into “lock add”
  • atomic_sub() turns into “lock sub”
I’m not sure that there are any Windows functions for exactly these.

Is a memory barrier (fences) required?

Spoiler: Not in x86 kernel code, including 32 and 64 bits. Surprise. Unless you really yearn for a headache, this is the right place to stop reading this post.

The theoretical problem is that each processor core might reorder the storing or fetching of data between registers, cache and memory in any possible way to increase performance. So if one one processor writes to X and then Y, and it’s crucial that the other processor sees the change in Y only when it also sees X updated, a memory barrier is often used. In the Linux kernel, smp_wmb() and smp_rbm() are used in conjunction to ensure this.

For example, if X is some data buffer, and Y is a flag saying that the data is valid. One processor fills the buffer X and then sets Y to “valid”. The other processor reads Y first, and if it’s valid, it accesses the buffer X. But what if the other processor sees Y as valid before it sees the data in X correctly? A store memory barrier before writing to Y and a read memory barrier before reading from X ensures this.

The thing is, that the Linux kernel’s implementation of smp_wmb() and smp_rbm() for x86 is a NOP. Note that it’s only the smp_*() versions that are NOPs. The non-smp fences turn into opcodes that implement fences. Assuming that the Linux guys know what they’re doing (which is a pretty safe assumption in this respect) they’re telling us that the view of ordering is kept intact across processor cores. In other words, if I can assure that X is written before Y on one processor, then Intel promises me that on another processor X will be read with the updated value before Y is seen updated.

Looking at how Microsoft’s examples solve certain multithreading issues, it’s quite evident that they trust the processor to retain the ordering of operations.

Memory fences are hence only necessary to ensure the ordering on a certain processor on x86. On different architectures (e.g. ARM v7) smp_wmb() and smp_rbm() actually do produce some code.

When are these fences really needed? From Intel’s 64 and IA-32 Architectures Software Developer’s Manual, Volume 2 (instruction set reference) vol. 2A page 4-22, on the “MFENCE” instruction:

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction).

That didn’t answer much. I searched for fence instructions in the disassembly of a Linux kernel compiled for x86_64. A lot of fence instructions are used in the initial CPU bringup, in particular after setting CPU registers. Makes sense. Then there are several other fence instructions in drivers, which aren’t necessarily needed, but who has the guts to remove them?

Most interesting is where I didn’t find a single fence instruction: In any of the object files generated in kernel/locking/. In other words, Linux’ mutexes and spinlocks are all implemented without any fence. So this is most likely a good proof that they aren’t needed for anything else but things related to the CPU state itself. I guess. For a 64-bit x86, that is.

Going back to Microsoft, it’s interesting that their docs for userspace Interlocked functions say “This function generates a full memory barrier (or fence) to ensure that memory operations are completed in order”, but not those for kernel space. Compare, for example InterlockedOr() for applications vs. the same function for kernel. Truth is I didn’t bother to do the same disassembly test for application code.

Some barriers functions

(or: A collection of functions you probably don’t need, even if you think you do)

  • KeFlushWriteBuffer(): Undocumented and rarely mentioned, intended for internal kernel use. Probably just makes sure that the cache has been flushed (?).
  • KeMemoryBarrier(): Calls _KeMemoryBarrier(). But in wdm.h, there’s an implementation of this function, calling FastFence() and LoadFence(). But these are just macros for __faststorefence and _mm_lfence. Looked at next.
  • _mm_lfence() : Turns into an lfence opcode. Same as rmb() in Linux.
  • _mm_sfence(): Turns into an sfence opcode. Same as wmb() in Linux.
  • _mm_mfence(): Turns into an mfence opcode.

I’ve verified that the _mm_*fence() builtins generated the said opcodes when compiled for x86 and amd64 alike. See some experiments on this matter below.

The deprecated _ReadBarrier(), _WriteBarrier() and _ReadWriteBarrier() produce no code at all. MemoryBarrier() ends up as a call to _MemoryBarrier().

Experimenting with fence instructions

(or: A major waste of time)

This is the code compiled:

__declspec(noinline) LONG do_it(LONG *p) {
  LONG x = 0;
  x += *p;
  _mm_lfence();
  x += *p;

  return x;
}

With a “checked compiation” this turns into:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 51                 push        ecx
  00000006: C7 45 FC 00 00 00  mov         dword ptr [ebp-4],0
            00
  0000000D: 8B 45 08           mov         eax,dword ptr [ebp+8]
  00000010: 8B 4D FC           mov         ecx,dword ptr [ebp-4]
  00000013: 03 08              add         ecx,dword ptr [eax]
  00000015: 89 4D FC           mov         dword ptr [ebp-4],ecx
  00000018: 0F AE E8           lfence
  0000001B: 8B 55 08           mov         edx,dword ptr [ebp+8]
  0000001E: 8B 45 FC           mov         eax,dword ptr [ebp-4]
  00000021: 03 02              add         eax,dword ptr [edx]
  00000023: 89 45 FC           mov         dword ptr [ebp-4],eax
  00000026: 8B 45 FC           mov         eax,dword ptr [ebp-4]
  00000029: 8B E5              mov         esp,ebp
  0000002B: 5D                 pop         ebp
  0000002C: C2 04 00           ret         4

OK, this is too much. There is no ptimization at all. So let’s look at the “free” compilation instead:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 8B 45 08           mov         eax,dword ptr [ebp+8]
  00000008: 8B 08              mov         ecx,dword ptr [eax]
  0000000A: 0F AE E8           lfence
  0000000D: 8B 00              mov         eax,dword ptr [eax]
  0000000F: 03 C1              add         eax,ecx
  00000011: 5D                 pop         ebp
  00000012: C2 04 00           ret         4

So clearly, the fence command made the compiler read the value from memory twice, as opposed to optimizing the second read away. Note that there’s no volatile keyword involved. So except for the fence, there’s no reason to read from *p twice.

The exact same result is obtained with _mm_mfence().

Trying with _mm_sfence() yields an interesting result however:

_do_it@4:
  00000000: 8B FF              mov         edi,edi
  00000002: 55                 push        ebp
  00000003: 8B EC              mov         ebp,esp
  00000005: 8B 45 08           mov         eax,dword ptr [ebp+8]
  00000008: 8B 00              mov         eax,dword ptr [eax]
  0000000A: 0F AE F8           sfence
  0000000D: 03 C0              add         eax,eax
  0000000F: 5D                 pop         ebp
  00000010: C2 04 00           ret         4

*p is read into eax once, then the fence, and then it’s added by itself. As opposed to above, where it was read into eax before the fence, then read again into ecx, and then added eax with ecx.

So the compiler felt free to optimize the two reads into one, because the store fence deals only with writes into memory, not reads. Given that there’s no volatile keyword used, it’s fine to optimize reads, which is exactly what it did.

The same optimization occurs if the fence command is removed completely, of course.

For the record, I’ve verified the equivalent behavior on the amd64 target (I’ll spare you the assembly code).

Add a Comment

required, use real name
required, will not be published
optional, your blog address