The program has a bug. It's mixing atomic and non-atomic variables in the yield(...

cesarb · on March 3, 2024

> there's no memory barrier instruction to force its new value (5) to propagate to all CPU's running the threads.

The equivalent of the memory barrier instructions is there, but it's hidden within the operating system code which creates and initializes a new thread. That is, the operating system ensures that the value in the current CPU (in this case, 5) is propagated to the CPU running the newly started thread, before the thread start routine (in this case, DoStuff) is called. The value is not modified while the child threads are running (it waits for the child threads to exit before clearing the value), so there's no chance of the child threads seeing the value being set back to zero.

ww520 · on March 3, 2024

Based on the Windows CreateThread API [1], it doesn't say anything about memory synchronization guarantee. Does it do internally?

[1] https://learn.microsoft.com/en-us/windows/win32/api/processt...

cesarb · on March 3, 2024

That MSDN documentation is unfortunately silent on this, but the example in the documentation (at https://learn.microsoft.com/en-us/windows/win32/procthread/c...) only makes sense if the operating system guarantees the ordering.

The C++ standard (at least a draft of it I found on a quick web search) is more explicit: it says (https://eel.is/c++draft/thread.thread.constr) "The completion of the invocation of the constructor synchronizes with the beginning of the invocation of the copy of f." (see https://eel.is/c++draft/intro.races for more detail on that "synchronizes with"). Since the code in question is using std::thread, even if the operating system did not have the relevant guarantees, the C++ standard library would have the required memory barriers.

tom_ · on March 4, 2024

Any time somebody tells you how simple C is by comparison, point them at https://port70.net/~nsz/c/c11/n1570.html#6.2.4p5:

> An object whose identifier is declared with no linkage and without the storage-class specifier static has automatic storage duration, as do some compound literals. The result of attempting to indirectly access an object with automatic storage duration from a thread other than the one with which the object is associated is implementation-defined

(I bet in practice almost all implementations behave as an equivalent C++ would, as per your notes, so only the ordering is relevant. But people maintaining C implementations have on occasion shown themselves to be their users' enemies, so don't quote me on this!)

ajross · on March 4, 2024

More generally, the OS is going to be doing some level of synchronization on its own, likely a spinlock, during the thread creation process. Those always include memory barriers, because otherwise the locks they define don't actually work on OO systems.

ot · on March 3, 2024

`numThreads` is written before the child threads that read it are started, so there is an explicit happens-before relationship and no data race. Before `numThreads` is reset, the child thread are joined.

There is no bug in the program, it is legal to use non-atomic variables across threads as long as they're correctly sequenced.

ww520 · on March 3, 2024

See my last paragraph above.

nemothekid · on March 3, 2024

>Its memory address is then passed to the child threads to be checked in the yielding loop. Since it's non-atomic, there's no memory barrier instruction to force its new value (5) to propagate to all CPU's running the threads.

Each core would have to fetch the value from main memory, where it will be undoubtedly 5. There is no valid reordering (at least under x86) that would cause the thread to read 0.

ww520 · on March 3, 2024

The main thread is running the child thread creation in an endless loop, repeatedly setting numThreads to 0 and to 5, back to 0 and to 5 again. Can the caches of the CPU's consistently keep up with the changes?

zozbot234 · on March 3, 2024

> repeatedly setting numThreads to 0 and to 5, back to 0 and to 5 again.

The reset to 0 and to 5 happens at the start of the loop. There's a happens-before relationship between it and the threads being created, and then again between the threads being joined and the loop cycling back. So there shouldn't be any data race here.

ww520 · on March 3, 2024

CPU’s not running the main thread don’t care about the execution order of the instructions of the main thread. They only see their local caches of the same memory location got changed from 0 to 5, 5 to 0, and back to 5 repeatedly. When a new thread lands on a CPU with the old 0 cache value, it will hang.

cesarb · on March 3, 2024

> CPU’s not running the main thread don’t care about the execution order of the instructions of the main thread.

On x86, they do (the x86 family is unusual in having strong memory ordering), but that's not the issue here.

> They only see their local caches of the same memory location got changed from 0 to 5, 5 to 0, and back to 5 repeatedly.

Their local caches of that memory see only a 5, since at the moment they read that cache line, the value in memory is 5; the operating system ensures that the write of the 5 value by the main thread is flushed to memory[*] before the main thread starts the child thread, and also that the cache of the child thread does not have stale data from before that moment. That memory location is only set back to 0 after all the child threads have exited, so there's no instant where the child thread could read a 0 on that location from main memory into its cache.

> When a new thread lands on a CPU with the old 0 cache value, it will hang.

When a new thread lands on a CPU core with an old 0 cache value for that memory location (which could happen if that CPU core had been running the main thread, and the main thread was migrated to another CPU core before it could set it back to 5), it will still see a 5 at that memory location, because the operating system invalidates the cache of a CPU core when necessary before starting a new thread on it.

[*] Actually, it only has to be flushed as far as the last level cache, or the "point of unification" in ARM terminology; I simplified a lot in this explanation.

zozbot234 · on March 3, 2024

The code runs in what's effectively a single-threaded context. CPUs running other threads will be involved when joining and launching threads, this should suffice as synchronization.

forrestthewoods · on March 3, 2024

Your assessment is completely, totally, and verifiably wrong. It's obnoxious, but not surprising, how many commenters have chimed in with "program has bug BLAH" when they provably did not try to run the program themselves.

Because if you did attempt to run the program you'd find that changing numThreads to a constexpr makes no difference.

ajross · on March 4, 2024

> Non-atomic variables have no guarantee on cache consistency for different threads

Atomicity and cache coherence are different things. "Atomic" means that the access to the element will be done in a single access and can only show exactly the state resulting from any other atomic access to the same value. For C syntax variables, this pretty much is limited to multi-word access (you also sometimes talk about atomic compare-and-set instructions, but those don't appear as part of the language per se).

Cache coherence is actually guaranteed on almost all systems, you don't need to worry about it on anything big enough to be running Windows (in the embedded world we get to fight it though).

The other demon in this space is memory reordering, but atomics don't speak to that at all.

userbinator · on March 3, 2024

You seem to have missed the part where an actual MS employee confirmed it was a bug in their API.

ww520 · on March 3, 2024

He just read OP's code, the C++/STL Standard, and the Microsoft Learn document and made that conclusion. That's a rather haste determination. Unless he read the actual Windows lock implementation and found a bug there, I don't think his conclusion is correct.

forrestthewoods · on March 3, 2024

> That's a rather haste determination.

I hope you realize how deeply ironic this statement is. If you read the comments you'll find he even produced a slightly reduced repro. And in other comments he tried minor tweaks like the one you suggested.

You have a thesis that the program has a bug. (It doesn't.) Go ahead and test your thesis and report back.