Why Lock-free Programming is Hard and How I Found a Bug in Lecture Slide

How I Found a Subtle Bug Hiding in a CMU 15-418/618 Lecture Slide

Lock-free programming has this magical aura around it. If you’ve ever heard of lock-free programming, you’ve probably seen those neat little Compare-And-Swap (CAS) loops that seem to solve everything.

I found a bug in a CAS loop that had been sitting quietly in CMU’s 15-418 Parallel Computer Architecture lecture slide for years.

The “Simple” Example

Here’s what the example in lecture slide looked like:

// atomic compare and swap
int atomicCAS(int* addr, int compare, int val) {
    int old = *addr;
    *addr = (old == compare) ? val : old;
    return old;
}

// build atomic max using CAS
void atomic_max(int* addr, int x) {
    int old = *addr;
    int new = max(old, x);
    while (atomicCAS(addr, old, new) != old) {
        old = *addr;
        new = max(old, x);
    }
}

The idea is:

atomicCAS() is a hardware primitive that atomically checks if *addr still equals compare.
If it does, it replaces it with val.
If not, someone else modified it — so you retry.

atomic_max() uses this to atomically set a memory location to the max of its current value and x.

The Bug

When I was looking at the code during lecture, I thought it was fine. Then after staring at it for 2 minutes, I pointed out it could potentially get stuck in an infinite loop during the lecture (from pure intuition without detailed tracing), and the professor acknowledged that it is indeed buggy. So I dived a bit deeper after class:

Something is off about this line inside the while loop in atomic_max():

old = *addr;

It looks innocent, right? Just reload the current value before retrying.

But here’s the subtlety: that load isn’t atomic. It’s just a plain read from memory that could be changing right now because another thread is also in the middle of its own CAS operation.

So by the time you compute new = max(old, x) again, your old might already be stale — and you’ve broken the atomic guarantee.

What Really Goes Wrong (Illustrated)

Let’s see why this matters with a concrete interleaving example.

Imagine two threads running atomic_max(&v, x) on the same variable v:

v = 10 initially
Thread A calls atomic_max(&v, 15)
Thread B calls atomic_max(&v, 12)

Here’s one possible execution:

Step	Thread A	Thread B	`*v`
1	Reads `old = 10`	Reads `old = 10`	10
2	Computes `new = max(10, 15) = 15`	Computes `new = max(10, 12) = 12`	10
3	A executes `atomicCAS(&v, 10, 15)` → succeeds	—	15
4	B executes `atomicCAS(&v, 10, 12)` → fails (returns 15)	—	15
5	B retries and executes `old = v;` — non-atomic read*, might see a stale value	—	?
6	If `old` is stale (e.g., still 10), B computes `new = 12` again	—	?
7	B does `atomicCAS(&v, 10, 12)` — the comparison fails, and it loops again	—	?
8	On some architectures, if memory reads are reordered or cached, B might never see the updated 15, causing an infinite loop

This is a classic data race: both threads read and write *v concurrently without synchronization. On x86, which have strong memory ordering, this appears to work. But on weaker memory models (ARM), this can cause non-termination, lost updates, or undefined behavior.

The Correct Fix

The fix turns out to be tiny — just change which value you use after a failed CAS. Instead of reloading from memory, you reuse the return value of atomicCAS(), which always tells you the actual value that was in memory during your attempt.

Here’s the correct version:

void atomic_max(int* addr, int x) {
    int old = *addr;
    int assumed;
    do {
        assumed = old;
        int new_val = max(assumed, x);
        old = atomicCAS(addr, assumed, new_val);
    } while (old != assumed);
}

That’s it, and yet it completely changes the semantics.

Now every iteration is based on a consistent snapshot — the exact value seen by the hardware during your last atomic attempt.

You might notice that the initial old = *addr isn’t an atomic load. That’s fine in this CAS loop because any stale value will be corrected on the first failed atomicCAS, which always returns the true current value.

Illustrative Example (Fixed Version)

Let’s revisit the same scenario. Here’s what happens under the corrected version:

Step	Thread A	Thread B	`*v`	What’s Happening
1	loads `old = 10`, computes `new_val = 15`	loads `old = 10`, computes `new_val = 12`	10	both start with the same initial value
2	`atomicCAS(addr, 10, 15)` → succeeds, returns 10	—	15	A wins the race and updates the shared value
3	A’s loop check: `old == assumed` → true → A exits	—	15	A completes successfully
4	—	`atomicCAS(addr, 10, 12)` → fails, returns 15	15	B’s CAS fails since `*v` is now 15
5	—	loop sees `old != assumed`, retries with `old = 15`	15	B continues with the latest value seen
6	—	computes `new_val = max(15, 12) = 15`, then `atomicCAS(addr, 15, 15)` → returns 15	15	B “updates” the same value — effectively a no-op
7	—	loop check: `old == assumed` → true → B exits	15	B completes, no infinite loop

Final state: *v = 15
Both threads complete successfully. A’s update “wins,” and B observes the updated value through the return value of atomicCAS() on its next iteration.

The Correct Fix (Standard C++ version)

To write this piece of code in standard C++ using std::atomic, we need to use the built-in compare-and-exchange operation instead of atomicCAS which provides the same atomic update behavior under the C++ memory model.

One caveat is that the initial load must now be atomic — hence the use of addr.load(std::memory_order_relaxed) instead of a plain *addr.

#include <atomic>
#include <algorithm>  

void atomic_max(std::atomic<int>& addr, int val) {
    int expected = addr.load(std::memory_order_relaxed);
    int new_val;
    do {
        new_val = std::max(expected, val);
    } while (!addr.compare_exchange_weak(
        expected, new_val,
        std::memory_order_release,  // on success: publish the update
        std::memory_order_relaxed   // on failure: just retry, no ordering needed
    ));
}

A few details worth noting:

compare_exchange_weak may spuriously fail, so always use it in a loop.
std::memory_order_release ensures the successful write is visible to other threads that later acquire the same atomic.
std::memory_order_relaxed on failure avoids unnecessary synchronization overhead for retries.

Why Lock-free Programming Is So Hard

Catching that bug was one of those “oh wow” moments that really stuck with me.

The fact that this buggy code persisted in a CMU slide deck for years is a subtle but interesting reminder that lock-free programming is not just tricky — it’s deceptively tricky.

Because even when you know what atomicCAS() does, even when you’ve seen CAS loops before, your brain glosses over that line. It just feels right.

It takes a weird kind of obsessive rereading to realize that the problem isn’t in the logic — it’s in the assumptions about atomicity.

Concurrency is so hard because our brains aren’t wired to think in interleavings, caches, and atomic snapshots; we’re wired to think sequentially. We have to reason about what the hardware is doing at the level of cache coherence, memory ordering, and atomic visibility — all at once.

The “Simple” Example#

The Bug#

What Really Goes Wrong (Illustrated)#

The Correct Fix#

Illustrative Example (Fixed Version)#

The Correct Fix (Standard C++ version)#

Why Lock-free Programming Is So Hard#