Microsoft Windows: Atomic ops and memory barriers
Posted Under: Microsoft,Software,Windows device drivers
Introduction
This post examines what the Microsoft’s compiler does in response to a variety of special functions that implement atomic operations and memory barriers. If you program like a civilized human being, that is with spinlocks and mutexes, this is a lot of things you should never need to care about.
I’ve written a similar post regarding Linux, and it’s recommended to read it before this one (even though it repeats some of the stuff here).
To make a long story short:
- The Interlocked-something functions do not just guarantee atomicity, but also function as a memory barrier to the compiler
- Memory fences are unnecessary for the sake of synchronizing between processors
- The conclusion is hence that those Interlocked functions also work as multiprocessor memory barriers.
Trying it out
The following code:
LONG atomic_thingy; __declspec(noinline) LONG do_it(LONG *p) { LONG x = 0; LONG y; x += *p; y = InterlockedExchangeAdd(&atomic_thingy, 0x1234); x += *p; return x + y; }
When compiling this as “free” (i.e. optimized), this yields:
_do_it@4: 00000000: 8B FF mov edi,edi 00000002: 55 push ebp 00000003: 8B EC mov ebp,esp 00000005: 8B 45 08 mov eax,dword ptr [ebp+8] 00000008: 8B 10 mov edx,dword ptr [eax] 0000000A: 56 push esi 0000000B: B9 34 12 00 00 mov ecx,1234h 00000010: BE 00 00 00 00 mov esi,offset _atomic_thingy 00000015: F0 0F C1 0E lock xadd dword ptr [esi],ecx 00000019: 8B 00 mov eax,dword ptr [eax] 0000001B: 03 C1 add eax,ecx 0000001D: 03 C2 add eax,edx 0000001F: 5E pop esi 00000020: 5D pop ebp 00000021: C2 04 00 ret 4
First thing to note is that InterlockedExchangeAdd() has been translated into a “LOCK XADD”, which indeed fetches the previous value into ECX and stores the updated value into memory. The previous value is stored in ECX after this operation, which is @y in the C code — InterlockedExchangeAdd() returns the previous value.
XADD by itself isn’t atomic, which is why the LOCK prefix is added. More about LOCK below.
What is crucially important to note, is that putting InterlockedExchangeAdd() between the two reads of *p prevents the optimizations of these two reads into one. @p isn’t defined as volatile, and yet it’s read from twice (ptr [eax]).
Another variant, now trying InterlockedOr():
LONG atomic; __declspec(noinline) LONG do_it(LONG *p) { LONG x = 0; LONG y; x += *p; y = InterlockedOr(&atomic, 0x1234); x += *p; return x + y; }
Once again, compiled as “free”, turns into this:
_do_it@4: 00000000: 8B FF mov edi,edi 00000002: 55 push ebp 00000003: 8B EC mov ebp,esp 00000005: 8B 4D 08 mov ecx,dword ptr [ebp+8] 00000008: 8B 11 mov edx,dword ptr [ecx] 0000000A: 53 push ebx 0000000B: 56 push esi 0000000C: 57 push edi 0000000D: BE 34 12 00 00 mov esi,1234h 00000012: BF 00 00 00 00 mov edi,offset _atomic 00000017: 8B 07 mov eax,dword ptr [edi] 00000019: 8B D8 mov ebx,eax 0000001B: 0B DE or ebx,esi 0000001D: F0 0F B1 1F lock cmpxchg dword ptr [edi],ebx 00000021: 75 F6 jne 00000019 00000023: 8B F0 mov esi,eax 00000025: 8B 01 mov eax,dword ptr [ecx] 00000027: 5F pop edi 00000028: 03 C6 add eax,esi 0000002A: 5E pop esi 0000002B: 03 C2 add eax,edx 0000002D: 5B pop ebx 0000002E: 5D pop ebp 0000002F: C2 04 00 ret 4
This is quite amazing: The OR isn’t implemented as an atomic operation, but rather it goes like this: The previous value of @atomic is fetched into EAX and then moved to EBX. It’s ORed with the constant 0x1234, and then the cmpxchg instruction writes the result (in EBX) into @atomic only if its previous value was the same as EAX. If not, the previous value is stored in EAX instead. In the latter case, the JNE loops back to try again.
I should mention that cmpxchg compares with EAX and stores the previous value to it if the comparison fails, even though this register isn’t mentioned explicitly in the instruction. It’s just a thing that cmpxchg works with EAX. EBX is the register to compare with, and it therefore appears in the instruction. Confusing, yes.
Also here, *p is read twice.
It’s also worth noting that using InterlockedOr() with the value 0 as a common way to grab the current value yields basically the same code (only the constant is generated with a “xor esi,esi” instead).
So if you want to use an Interlocked function just to read from a variable, InterlockedExchangeAdd() with zero is probably better, because it doesn’t create all that loop code around it.
Another function I’d like to look at is InterlockedExchange(), as it’s used a lot. Spoiler: No surprises are expected.
LONG atomic_thingy; __declspec(noinline) LONG do_it(LONG *p) { LONG x = 0; LONG y; x += *p; y = InterlockedExchange(&atomic_thingy, 0); x += *p; return x + y; }
and this is what it compiles into:
_do_it@4: 00000000: 8B FF mov edi,edi 00000002: 55 push ebp 00000003: 8B EC mov ebp,esp 00000005: 8B 45 08 mov eax,dword ptr [ebp+8] 00000008: 8B 10 mov edx,dword ptr [eax] 0000000A: 56 push esi 0000000B: 33 C9 xor ecx,ecx 0000000D: BE 00 00 00 00 mov esi,offset _atomic_thingy 00000012: 87 0E xchg ecx,dword ptr [esi] 00000014: 8B 00 mov eax,dword ptr [eax] 00000016: 03 C1 add eax,ecx 00000018: 03 C2 add eax,edx 0000001A: 5E pop esi 0000001B: 5D pop ebp 0000001C: C2 04 00 ret 4
And finally, what about writing twice to the same place?
LONG atomic_thingy; __declspec(noinline) LONG do_it(LONG *p) { LONG y; *p = 0; y = InterlockedExchangeAdd(&atomic_thingy, 0); *p = 0; return y; }
Writing the same constant value twice to a non-volatile variable. This calls for an optimization. But the Interlocked function prevents it, as expected:
_do_it@4: 00000000: 8B FF mov edi,edi 00000002: 55 push ebp 00000003: 8B EC mov ebp,esp 00000005: 8B 4D 08 mov ecx,dword ptr [ebp+8] 00000008: 83 21 00 and dword ptr [ecx],0 0000000B: 33 C0 xor eax,eax 0000000D: BA 00 00 00 00 mov edx,offset _atomic_thingy 00000012: F0 0F C1 02 lock xadd dword ptr [edx],eax 00000016: 83 21 00 and dword ptr [ecx],0 00000019: 5D pop ebp 0000001A: C2 04 00 ret 4
Writing a zero is implemented by ANDing with zero, so it’s a bit confusing. But it’s done twice, before and after the “lock xadd”. Needless to say, these two writes are fused into one by the compiler without the Interlocked statement in the middle (I’ve verified it with disassembly, 32 and 64 bit).
Volatile?
In Microsoft’s definition for the InterlockedExchangeAdd() function (and all other similar ones) is that the first operand is a pointer to a LONG volatile. Why volatile? Does the variable really need to be?
The answer is no, if all accesses to the variable is made with Interlocked-something functions. There will be no compiler optimizations, not on the call itself, and the call itself is a compiler memory barrier.
And it’s a good habit to stick to these functions, because of this implicit compiler memory barrier: That’s usually what we want and need, even if we’re not fully aware of it. Accessing a shared variable almost always has a “before” and “after” thinking around it. The volatile keyword doesn’t protect against reordering optimizations by the compiler (or otherwise).
But if the variable is accessed without these functions in some places, the volatile keyword should definitely be used when defining that variable.
More about LOCK
It’s a prefix that is added to an instruction in order to ensure that it’s performed atomically. In many cases, it’s superfluous and sometimes ignored, as the atomic operation is ensured anyhow.
From Intel’s 64 and IA-32 Architectures Software Developer’s Manual, Volume 2 (instruction set reference) vol. 2A page 3-537, on the “LOCK” prefix for instructions:
Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted.
The manual elaborates further on the LOCK prefix, but says nothing about memory barriers / fences. This is implemented with the MFENCE, SFENCE and LFENCE instructions.
The LOCK prefix is encoded with an 0xf0 coming before the instruction in the binary code, by the way.
Linux counterparts
For x86 only, of course:
- atomic_set() turns into a plain “mov”
- atomic_add() turns into “lock add”
- atomic_sub() turns into “lock sub”
Is a memory barrier (fences) required?
Spoiler: Not in x86 kernel code, including 32 and 64 bits. Surprise. Unless you really yearn for a headache, this is the right place to stop reading this post.
The theoretical problem is that each processor core might reorder the storing or fetching of data between registers, cache and memory in any possible way to increase performance. So if one one processor writes to X and then Y, and it’s crucial that the other processor sees the change in Y only when it also sees X updated, a memory barrier is often used. In the Linux kernel, smp_wmb() and smp_rbm() are used in conjunction to ensure this.
For example, if X is some data buffer, and Y is a flag saying that the data is valid. One processor fills the buffer X and then sets Y to “valid”. The other processor reads Y first, and if it’s valid, it accesses the buffer X. But what if the other processor sees Y as valid before it sees the data in X correctly? A store memory barrier before writing to Y and a read memory barrier before reading from X ensures this.
The thing is, that the Linux kernel’s implementation of smp_wmb() and smp_rbm() for x86 is a NOP. Note that it’s only the smp_*() versions that are NOPs. The non-smp fences turn into opcodes that implement fences. Assuming that the Linux guys know what they’re doing (which is a pretty safe assumption in this respect) they’re telling us that the view of ordering is kept intact across processor cores. In other words, if I can assure that X is written before Y on one processor, then Intel promises me that on another processor X will be read with the updated value before Y is seen updated.
Looking at how Microsoft’s examples solve certain multithreading issues, it’s quite evident that they trust the processor to retain the ordering of operations.
Memory fences are hence only necessary to ensure the ordering on a certain processor on x86. On different architectures (e.g. ARM v7) smp_wmb() and smp_rbm() actually do produce some code.
When are these fences really needed? From Intel’s 64 and IA-32 Architectures Software Developer’s Manual, Volume 2 (instruction set reference) vol. 2A page 4-22, on the “MFENCE” instruction:
Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction).
That didn’t answer much. I searched for fence instructions in the disassembly of a Linux kernel compiled for x86_64. A lot of fence instructions are used in the initial CPU bringup, in particular after setting CPU registers. Makes sense. Then there are several other fence instructions in drivers, which aren’t necessarily needed, but who has the guts to remove them?
Most interesting is where I didn’t find a single fence instruction: In any of the object files generated in kernel/locking/. In other words, Linux’ mutexes and spinlocks are all implemented without any fence. So this is most likely a good proof that they aren’t needed for anything else but things related to the CPU state itself. I guess. For a 64-bit x86, that is.
Going back to Microsoft, it’s interesting that their docs for userspace Interlocked functions say “This function generates a full memory barrier (or fence) to ensure that memory operations are completed in order”, but not those for kernel space. Compare, for example InterlockedOr() for applications vs. the same function for kernel. Truth is I didn’t bother to do the same disassembly test for application code.
Some barriers functions
(or: A collection of functions you probably don’t need, even if you think you do)
- KeFlushWriteBuffer(): Undocumented and rarely mentioned, intended for internal kernel use. Probably just makes sure that the cache has been flushed (?).
- KeMemoryBarrier(): Calls _KeMemoryBarrier(). But in wdm.h, there’s an implementation of this function, calling FastFence() and LoadFence(). But these are just macros for __faststorefence and _mm_lfence. Looked at next.
- _mm_lfence() : Turns into an lfence opcode. Same as rmb() in Linux.
- _mm_sfence(): Turns into an sfence opcode. Same as wmb() in Linux.
- _mm_mfence(): Turns into an mfence opcode.
I’ve verified that the _mm_*fence() builtins generated the said opcodes when compiled for x86 and amd64 alike. See some experiments on this matter below.
The deprecated _ReadBarrier(), _WriteBarrier() and _ReadWriteBarrier() produce no code at all. MemoryBarrier() ends up as a call to _MemoryBarrier().
Experimenting with fence instructions
(or: A major waste of time)
This is the code compiled:
__declspec(noinline) LONG do_it(LONG *p) { LONG x = 0; x += *p; _mm_lfence(); x += *p; return x; }
With a “checked compiation” this turns into:
_do_it@4: 00000000: 8B FF mov edi,edi 00000002: 55 push ebp 00000003: 8B EC mov ebp,esp 00000005: 51 push ecx 00000006: C7 45 FC 00 00 00 mov dword ptr [ebp-4],0 00 0000000D: 8B 45 08 mov eax,dword ptr [ebp+8] 00000010: 8B 4D FC mov ecx,dword ptr [ebp-4] 00000013: 03 08 add ecx,dword ptr [eax] 00000015: 89 4D FC mov dword ptr [ebp-4],ecx 00000018: 0F AE E8 lfence 0000001B: 8B 55 08 mov edx,dword ptr [ebp+8] 0000001E: 8B 45 FC mov eax,dword ptr [ebp-4] 00000021: 03 02 add eax,dword ptr [edx] 00000023: 89 45 FC mov dword ptr [ebp-4],eax 00000026: 8B 45 FC mov eax,dword ptr [ebp-4] 00000029: 8B E5 mov esp,ebp 0000002B: 5D pop ebp 0000002C: C2 04 00 ret 4
OK, this is too much. There is no ptimization at all. So let’s look at the “free” compilation instead:
_do_it@4: 00000000: 8B FF mov edi,edi 00000002: 55 push ebp 00000003: 8B EC mov ebp,esp 00000005: 8B 45 08 mov eax,dword ptr [ebp+8] 00000008: 8B 08 mov ecx,dword ptr [eax] 0000000A: 0F AE E8 lfence 0000000D: 8B 00 mov eax,dword ptr [eax] 0000000F: 03 C1 add eax,ecx 00000011: 5D pop ebp 00000012: C2 04 00 ret 4
So clearly, the fence command made the compiler read the value from memory twice, as opposed to optimizing the second read away. Note that there’s no volatile keyword involved. So except for the fence, there’s no reason to read from *p twice.
The exact same result is obtained with _mm_mfence().
Trying with _mm_sfence() yields an interesting result however:
_do_it@4: 00000000: 8B FF mov edi,edi 00000002: 55 push ebp 00000003: 8B EC mov ebp,esp 00000005: 8B 45 08 mov eax,dword ptr [ebp+8] 00000008: 8B 00 mov eax,dword ptr [eax] 0000000A: 0F AE F8 sfence 0000000D: 03 C0 add eax,eax 0000000F: 5D pop ebp 00000010: C2 04 00 ret 4
*p is read into eax once, then the fence, and then it’s added by itself. As opposed to above, where it was read into eax before the fence, then read again into ecx, and then added eax with ecx.
So the compiler felt free to optimize the two reads into one, because the store fence deals only with writes into memory, not reads. Given that there’s no volatile keyword used, it’s fine to optimize reads, which is exactly what it did.
The same optimization occurs if the fence command is removed completely, of course.
For the record, I’ve verified the equivalent behavior on the amd64 target (I’ll spare you the assembly code).