How I Found a Subtle Bug Hiding in a CMU 15-418/618 Lecture Slide


Lock-free programming has this magical aura around it. If you’ve ever heard of lock-free programming, you’ve probably seen those neat little Compare-And-Swap (CAS) loops that seem to solve everything.

I found a bug in a CAS loop that had been sitting quietly in CMU’s 15-418 Parallel Computer Architecture lecture slide for years.


The “Simple” Example

Here’s what the example in lecture slide looked like:

// atomic compare and swap
int atomicCAS(int* addr, int compare, int val) {
    int old = *addr;
    *addr = (old == compare) ? val : old;
    return old;
}

// build atomic max using CAS
void atomic_max(int* addr, int x) {
    int old = *addr;
    int new = max(old, x);
    while (atomicCAS(addr, old, new) != old) {
        old = *addr;
        new = max(old, x);
    }
}

The idea is:

  • atomicCAS() is a hardware primitive that atomically checks if *addr still equals compare.
  • If it does, it replaces it with val.
  • If not, someone else modified it — so you retry.

atomic_max() uses this to atomically set a memory location to the max of its current value and x.


The Bug

When I was looking at the code during lecture, I thought it was fine. Then after staring at it for 2 minutes, I pointed out it could potentially get stuck in an infinite loop during the lecture (from pure intuition without detailed tracing), and the professor acknowledged that it is indeed buggy. So I dived a bit deeper after class:

Something is off about this line inside the while loop in atomic_max():

old = *addr;

It looks innocent, right? Just reload the current value before retrying.

But here’s the subtlety: that load isn’t atomic. It’s just a plain read from memory that could be changing right now because another thread is also in the middle of its own CAS operation.

So by the time you compute new = max(old, x) again, your old might already be stale — and you’ve broken the atomic guarantee.


What Really Goes Wrong (Illustrated)

Let’s see why this matters with a concrete interleaving example.

Imagine two threads running atomic_max(&v, x) on the same variable v:

v = 10 initially
Thread A calls atomic_max(&v, 15)
Thread B calls atomic_max(&v, 12)

Here’s one possible execution:

StepThread AThread B*v
1Reads old = 10Reads old = 1010
2Computes new = max(10, 15) = 15Computes new = max(10, 12) = 1210
3A executes atomicCAS(&v, 10, 15) → succeeds15
4B executes atomicCAS(&v, 10, 12) → fails (returns 15)15
5B retries and executes old = *v;non-atomic read, might see a stale value?
6If old is stale (e.g., still 10), B computes new = 12 again?
7B does atomicCAS(&v, 10, 12) — the comparison fails, and it loops again?
8On some architectures, if memory reads are reordered or cached, B might never see the updated 15, causing an infinite loop

This is a classic data race: both threads read and write *v concurrently without synchronization. On x86, which have strong memory ordering, this appears to work. But on weaker memory models (ARM), this can cause non-termination, lost updates, or undefined behavior.


The Correct Fix

The fix turns out to be tiny — just change which value you use after a failed CAS. Instead of reloading from memory, you reuse the return value of atomicCAS(), which always tells you the actual value that was in memory during your attempt.

Here’s the correct version:

void atomic_max(int* addr, int x) {
    int old = *addr;
    int assumed;
    do {
        assumed = old;
        int new_val = max(assumed, x);
        old = atomicCAS(addr, assumed, new_val);
    } while (old != assumed);
}

That’s it, and yet it completely changes the semantics.

Now every iteration is based on a consistent snapshot — the exact value seen by the hardware during your last atomic attempt.

You might notice that the initial old = *addr isn’t an atomic load. That’s fine in this CAS loop because any stale value will be corrected on the first failed atomicCAS, which always returns the true current value.


Illustrative Example (Fixed Version)

Let’s revisit the same scenario. Here’s what happens under the corrected version:

StepThread AThread B*vWhat’s Happening
1loads old = 10, computes new_val = 15loads old = 10, computes new_val = 1210both start with the same initial value
2atomicCAS(addr, 10, 15) → succeeds, returns 1015A wins the race and updates the shared value
3A’s loop check: old == assumed → true → A exits15A completes successfully
4atomicCAS(addr, 10, 12) → fails, returns 1515B’s CAS fails since *v is now 15
5loop sees old != assumed, retries with old = 1515B continues with the latest value seen
6computes new_val = max(15, 12) = 15, then atomicCAS(addr, 15, 15) → returns 1515B “updates” the same value — effectively a no-op
7loop check: old == assumed → true → B exits15B completes, no infinite loop

Final state: *v = 15
Both threads complete successfully. A’s update “wins,” and B observes the updated value through the return value of atomicCAS() on its next iteration.


The Correct Fix (Standard C++ version)

To write this piece of code in standard C++ using std::atomic, we need to use the built-in compare-and-exchange operation instead of atomicCAS which provides the same atomic update behavior under the C++ memory model.

One caveat is that the initial load must now be atomic — hence the use of addr.load(std::memory_order_relaxed) instead of a plain *addr.

#include <atomic>
#include <algorithm>  

void atomic_max(std::atomic<int>& addr, int val) {
    int expected = addr.load(std::memory_order_relaxed);
    int new_val;
    do {
        new_val = std::max(expected, val);
    } while (!addr.compare_exchange_weak(
        expected, new_val,
        std::memory_order_release,  // on success: publish the update
        std::memory_order_relaxed   // on failure: just retry, no ordering needed
    ));
}

A few details worth noting:

  • compare_exchange_weak may spuriously fail, so always use it in a loop.
  • std::memory_order_release ensures the successful write is visible to other threads that later acquire the same atomic.
  • std::memory_order_relaxed on failure avoids unnecessary synchronization overhead for retries.

Why Lock-free Programming Is So Hard

Catching that bug was one of those “oh wow” moments that really stuck with me.

The fact that this buggy code persisted in a CMU slide deck for years is a subtle but interesting reminder that lock-free programming is not just tricky — it’s deceptively tricky.

Because even when you know what atomicCAS() does, even when you’ve seen CAS loops before, your brain glosses over that line. It just feels right.

It takes a weird kind of obsessive rereading to realize that the problem isn’t in the logic — it’s in the assumptions about atomicity.

Concurrency is so hard because our brains aren’t wired to think in interleavings, caches, and atomic snapshots; we’re wired to think sequentially. We have to reason about what the hardware is doing at the level of cache coherence, memory ordering, and atomic visibility — all at once.