Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Correctly implementing a spinlock in Modern C++ (rigtorp.se)
115 points by symisc_devel on June 25, 2021 | hide | past | favorite | 97 comments


Spinlocks are great way of telling the kernel that you believe in small governments, and no regulations. I mean really, why should the kernel meddle in your user business?


Because we live in a society and not every program can live it's life without affecting others


I write embedded code that lives in the woods like a hermit.


Favor local variables over global variables where able.


RAM is a global variable


I never leave L1.


I did once write a small program for the PDP-10 that ran entirely in the registers (which happened also to be addressable as the first 18 words of memory)


What did it do?


Something dull like compute fibbonaci numbers. Though there was a legit reason to do this: IIRC this procedure was how you bootstrapped the machine.


I never leave the trace cache. FEEB.


I can't help but think your comment resembles something like Godwin's law: any technical discussion eventually drifts into comparisons with politics.


We speak of state and State, programs and Programs, rules and Laws.

Technical discussions are more deterministic, I suppose, but at a high enough level of abstraction, it's all so much information management.


You literally called memory safety an authoritarian dystopia in another comment.



To me the benefit of the sorlock is that you can quicken Eldritch Blast. I have an 8th level sorlock (3 W, 5 S) who can do 4 EBs per round if he really has to.


One neat thing is, you don't actually need a spinlock if you're using a ringbuffer. You can use atomic increment on a counter to claim a slot, then write to that slot.

It can be between two to three orders of magnitude higher throughput. Equally important, lower latency.

(Throughput tends to be a consequence of low latency, but not always.)

I'm not saying this should be the norm, though. You probably don't need this design. But when you do, e.g. processing millions of stock market messages, there's no substitute.

EDIT: I love hearing about the designs and questions, but it was probably a mistake for me not to be explicit. Sorry! The thing I'm referring to is LMAX Disruptor pattern: https://lmax-exchange.github.io/disruptor/

I learned about it in 2011-ish, and it deeply changed my perspective on high speed designs.


I spent several years developing lockfree algorithms and very often CAS loops are the empirically fastest solution. But as often, you can perform as good or better with atomic increments. The devil is in the details of what sort of tasks you’re managing and on which specific hardware.


The problem is that CAS loops in locking scenarios aren't an option if you're subject to preemption. Yes, true lock-free datastructures are allowed, but for many there are situations where you could practically encounter live locks (i.e., they aren't just a theoretical possibility for the access algorithm).

Atomic increment is still fairly cheap/efficient and resolves many cases where you'd risk live lock by being wait-free in some aspects.


There are many scenarios where you’re subject to preemption and CAS is empirically faster and completely viable. This including hardware with over 200 hardware threads of concurrency. In that same realm of hardware, atomic increments can be as fast or faster.

Certain devilish details I’m sure are exceptions to those scenarios, just sharing the ones I’ve come across.


Yeah, we've been bitten by horrible degradation as soon as anything else is running on the system next to the main application, like a cron job or unknowingly running CI on a VM where the VM's vCPUs are subject to preemption.

As long as you aren't oversubscribing hardware threads, you'll be fine, but at the cost of pathological behavior once you add just a single additional thread.

Also don't use `sched_yield(2)` for spin locks, unless you're running priority-based RT Linux. A scheduler that is good for that scenario tends to be quite bad at most real-world scenarios that don't try to do their own spinlocks in userspace due to NIH syndrome.


I have not found this to be the case in my situation but there are so many variations of this that I could be missing something.

So from my experiments with this, in order to avoid a lock you need to have a ringbuffer of atomic pointers (or atomic values if you're literally just buffering ints or similar). You claim a slot which is a pointer, and then swap out the pointer to the stale value with the pointer to the updated value.

But this now requires a multithreaded lock free memory pool. Single threaded this is no problem, but multithreaded this is incredibly difficult without high latency.

From my own efforts, while it's possible to implement a lock free ring buffer + suitable memory allocator, you end up with high latency. This high latency can only be justified if the size of the data you're buffering is greater than a certain bound because you get increased bandwidth.

For data less than a certain bound, my experience is you get much lower latency and lower bandwidth (but still pretty good bandwidth) by tagging each element with a lock-bit.

I suspect we work in the same area (finance), so if you have insight into good lock free memory allocators or resources I'd be very happy to better inform myself.


It's easy to fall into traps! I edited my original comment with a link to https://lmax-exchange.github.io/disruptor/ which has all the details on avoiding the pitfalls you mention.

As to your specific concern, I think you should be able to preallocate a large buffer of objects that you want to share. In other words, the allocations only need to happen infrequently.

The conversation going from "ringbuffer" to "multithreaded lock-free memory pool" is throwing up warning signals. It's true that you do need to be allocating memory, but the memory can be allocated by a thread (lock free), and then the slot is claimed and pointer written (still lock free). But there's nothing special about this process -- just allocate some memory, and stick it into the slot.

The ringbuffer is the thing that handles the coordination, enabling you to allocate memory on whatever thread you want.

Is there a reason more complexity is justified? (More complexity might entirely be justified, and I just haven't had that experience.)


>the memory can be allocated by a thread (lock free)

If there is one thread A allocating memory to push a value onto the ring buffer, and another thread B popping a value off of the ring buffer, how does B free that memory back to the memory allocator without introducing a data race? Certainly there must be some kind of synchronization so that A can allocate memory and B can free that memory.

>The conversation going from "ringbuffer" to "multithreaded lock-free memory pool" is throwing up warning signals.

Yes, because in many cases when I see lock free data structures and get excited about it, what is really presented is a data structure that is putting all of the locking pressure on the memory allocator so that the system as a whole has no net gain.

And this comes down to the crux of the issue, you can write a lock free ring buffer if you stuff all your locking into your memory allocator and I suppose you could claim that the ring buffer is lock free... but the system of ring buffer + memory allocator is then no longer lock free.

That said I'm not saying that this is bad, being lock free doesn't mean good, fast whereas using a lock means slow, bad... it's just that there are a lot of subtle details that make a proper analysis of this much more difficult than it first appears and to the best of my knowledge there is no lock free ring buffer that gives a clear performance benefit.

But as I said, this is such a tricky subject with so many different possible configurations that I would love to see different approaches.


Memory allocation can be avoided by using preallocated buffers for storage. Yeah, that limits the number of maximum elements and could waste memory like hell, but have to live with that and get performance in return. Race could be eliminated by using separate dirty and free lists (circular buffers, probably implemented as arrays, with the same number of elements as the data storage has). Like declare your storage as a global array, add all of its indexes to the freelist. When adding data to the storage, pop an index from the front of the freelist, fill the given slot of the array, push the index to the back of the dirty list. When processing data, pop an index from the front of the dirty list, process the slot, push the index to the back of the freelist.


> If there is one thread A allocating memory to push a value onto the ring buffer, and another thread B popping a value off of the ring buffer, how does B free that memory back to the memory allocator without introducing a data race? Certainly there must be some kind of synchronization so that A can allocate memory and B can free that memory.

I'm tempted to say "No need to free it; next time A needs one, let it have that one that you were going to free."

In other words, claiming a slot also claims an already-allocated object.

You're right; this isn't a trivial design consideration. And I'm second-guessing myself as to whether my answer here is wrong. If you see a problem with it, definitely call it out.

(Cheers for the interesting conversation, by the way... Didn't expect it.)


Yeah for sure, at any rate I went over the LMAX Disruptor link you provided and while it doesn't use "locks", it is not a lock-free data structure. The confusion is that their use of the work "lock" means yielding to the operating system, which they don't do, but if a thread is performing a write, then all other threads will spin in a tight loop (which is basically a spin lock) until the write is committed. This is not a lock-free data structure in the typical sense of the word (guarantees at least one thread will make progress) since if the JVM pre-empts the thread currently in the process of writing, then all other writing threads are starved.

You can read more about it here in Section 4.3 on page 6 where they have the following busy waiting:

    long expectedSequence = claimedSequence – 1;
    while (cursor != expectedSequence) {
      // busy spin
    }
    cursor = claimedSequence
https://lmax-exchange.github.io/disruptor/files/Disruptor-1....

Some other details of the blocking are here:

http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-...


You have many options, here are two:

B) Your ring buffer is your allocator: copy (or inplace-construct) your messages direcly in your ring buffer. This work very well for short messages like continuations.

B) each producer thread has a freed-memory lock free queue so that consumers can return memory. The producer can pop the whole queue in one go when it has exhausted its local cache, but pushing an object in the queue can be expensive if the producer was talking with multiple threads. If you want to get fancy you can have NxN spsc queues which works fine with the one thread per core model, but obviously won't scale to ten of thousands of threads.


> Your ring buffer is your allocator: copy (or inplace-construct) your messages direcly in your ring buffer.

Then you will need a lock. There's no way to atomically copy data in place without locking.

>each producer thread has a freed-memory lock free queue so that consumers can return memory.

Variations of this is how lock free allocators work and the latency penalty for this strategy is very significant, anywhere from 3-5x the latency penalty of a blocking allocator. Certainly lock free allocators have their use cases and if you have a system that needs bandwidth over latency you go for it, but the point is that unless you have hard real time needs for your system, then you're usually better off going for a blocking data structure.


You do not really need a lock. On an spsc queue, the producer will publish the message (by bumping the write pointer of setting the next pointer on the previous message) only after it is done constructing it. In the mpsc case the producer will also need to reserve the space first by atomically increasing the producer shared write pointer.

This is similar to the disruptor model, except that write positions are not pointer sized but arbitrary sized. Similarly to the disruptor model, the mpsc case is technically not lock-free (not even obstruction-free), but writers and the consumer never need to block on an actual lock.


There are multiple lock-free allocators now that are suitable for production, such as mimalloc.


I was specific in saying that it's hard to implement a low latency lock free memory allocator. mimalloc has very high latency, almost triple that of jemalloc (which is not lock-free).

Every single lock-free allocator I've seen is intended to satisfy high throughput at the expense of incurring high latency, and mimalloc is no different in this respect. At any rate you don't normally use a general purpose memory allocator when you need a high performance data structure, instead opting for a data structure specific allocator/memory pool.


Do you have an example of this in action? I'd be interested to know what steps are taken to avoid false sharing of cache lines in the ring buffer itself.


You pad each counter so that it's on a separate cache line. https://lmax-exchange.github.io/disruptor/ goes into detail on the false sharing problem.

(A+ and full marks on recognizing that that's a central problem.)


I think reitzensteinm was referring to the actual data inside the ring. There is a false sharing problem there particularly for the MPMC type ring buffer like disruptor. The solution is to pad each data slot in addition to the read and write indices. I do this in my MPMC ring buffer queue: https://github.com/rigtorp/MPMCQueue/blob/master/include/rig...


You can align and pad each ring buffer slot to the cache line size. Example https://github.com/rigtorp/MPMCQueue/blob/master/include/rig...


Ringbuffers and thread-per-core architecture is great for low latency transaction processing systems, like exchanges and trading systems.

Sometimes you don't have time to do things perfectly and might use a spinlock when initializing some cache on first use... Only light contention, not great not, not terrible :).


Last time I looked at Disruptor it still required a WaitStrategy, one of which was a Spinlock


Not for the single producer multi consumer case, which is approximately the most useful case. (For me, anyway.)

If you have multiple producers and consumers though, then yes. But it's very cool to me that you can do without the spinlock in any useful case.


Are you sure? It's been a while since I ported it to C++ and so maybe the code has changed, but doesn't the producer have to wait for all consumers to consume, and don't consumers have to wait for the producer to produce?

The BlockingWaitStrategy was mutex based IIRC.


The original whitepaper goes into detail about where the lock isn't necessary: https://lmax-exchange.github.io/disruptor/files/Disruptor-1....

I did this in 2011 or so, so it's been a decade. But my feelings are, "I'm quite certain that if I were to re-read the paper from start to finish, I'd end up feeling convinced of the position I just tried to convince you of."

If that's not true, then I'm simply a fool, and will happily concede. :) But! For now, I have to go look at a house that we're thinking of renting.


Long live the LMAX?


All hail!

It was truly a cool design, for its time.

I feel strongly like it's one of those designs you should study just to even be aware that such a thing exists. Otherwise I probably would've been like "Oh, a spinlock! Yes, always."

(It's the disruptor pattern: https://lmax-exchange.github.io/disruptor/)


Notice that this code probably has a bug. [Edited to add: spacechild1 says it doesn't, and I now believe them]

It keeps using exchange() which swaps the old value in memory for your value and give back the old value, but it sets std::memory_order_acquire with the author apparently thinking that since this wants to acquire a lock this is enough.

But it isn't. The exchange() call is two memory operations, it's a load and a store, and so what you wanted here was Acquire and Release semantics ie. memory_order::acq_rel

What has been written is effectively Relaxed semantics for the store, and C++ doesn't do a great job of explaining that to programmers.


Nope, std::memory_order_acquire is perfectly fine here. Just because an atomic operation involves a store doesn't necessarily mean you always need/want to synchronize with that store. In this case, we only need to make sure that subsequent code is not reordered before the atomic load, hence the aquire semantics.


Thanks, I guess?

I spent a whole lot of time staring at this, and I was eventually able to convince myself that you're correct, there isn't any way for this to actually go wrong.

In the case where you're storing a different value than was already there, you just took the lock and even in relaxed ordering that is in fact an atomic operation, nobody else has the lock. In every other case who cares if you release, you didn't change anything anyway. The unlock() will release, and so anybody who subsequently acquires the lock will see the changes made under the lock.

Even after convincing myself this analysis is correct, I'm still scared that it's wrong, anyway. After all my previous analysis was wrong :/


This stuff is so subtle.

It's not actually clear to me right now what part of the memory model would prevent the release store of an earlier unlock from being moved past the acquire load of a later lock.

Except that you wouldn't move it past a single acquire load, you'd potentially move it past an infinite number of acquire loads, and maybe then you run into formal language about forward progress?

There are similarly fun questions around hoisting an acquire load out of an otherwise empty loop...


> what part of the memory model would prevent the release store of an earlier unlock from being moved past the acquire load of a later lock.

Are you talking about different locks? In that case, yes, reordering can happen.

If it's about the same lock, then this can't happen because it would change the semantics of the code (you can't aquire the lock without releasing it first).


That reordering would be bad even with different locks because it can introduce deadlocks, as somebody has pointed out.

There must be some formal semantics reason which forbids it, and I think the reason boils down to the lock not being a single acquire load but instead a loop of acquire loads.

Unlock followed by lock is a store that is followed by a potentially infinite loop. If you move the store past the loop, and the loop happens to be or becomes infinite, then the effect of the store will never become visible, and that is an incorrect program transform because it'd be as-if you simply erased the store.

So you can probably move the store past any finite number of loads, but not past a potentially infinite loop.


Yes this re-ordering can cause a deadlock. But it's not an allowed optimization to change a non-deadlocking program into a potentially deadlocking program, so this reordering can not happen. (http://eel.is/c++draft/intro.multithread#intro.progress-7)

You can still get this deadlock if you actually write the code like that: lock1.lock(); lock2.unlock();


"aquire" means that the compiler can't move any reads/writes up before the load. "release" means that it can't move any reads/writes down after the store.

The compiler can move a release store of variable A past an aquire load of variable B because it violates neither constraints.

The opposite is not true: it can't move an aquire load of variable A past a release store of variable B because it violates both contraints.

For a lock, aquire/release semantics are sufficient because its only job is to protect some shared piece of data, it doesn't care about other locks.

> reordering would be bad even with different locks because it can introduce deadlocks, as somebody has pointed out.

Do you have an example of such a deadlock? Maybe it's best to work with some actual code.


Example:

Thread 1 does A.lock, A.unlock, B.lock, B.unlock

Thread 2 does B.lock, B.unlock, A.lock, A.unlock

That's fine, but swap the middle pair of unlock followed by lock on both threads and you have a standard deadlock based on different lock nestings.

And to reiterate from my previous comment, I do think that transform is or should be forbidden. That's not for synchronization reasons per se, but because the transform can change the set of externally visible side effects of the code. This argument requires that a store with release semantics is considered to be an externally visible side effect. That makes intuitive sense to me, though I don't remember what e.g. the C++ standard actually says about that.


Although pure aquire-release semantics would allow such reordering, it also didn't feel quite right to me, either. Fortunately, someone asked the exact same question on Stack Overflow and got a good answer: https://stackoverflow.com/a/61300584

Looks like you've been on the right track with regard to possible infinite loops.


Isn't this case discussed at the end of the article?

> C++ committee member Tony Van Eerd commented on reddit that with std::memory_order_acquire and two locks a and b; calls to a.unlock(); b.lock() and lock() for different locks could be reordered to b.lock(); a.unlock(); and introduce a potential deadlock. I don’t think that’s true. Reading section 6.9.2.1 Data races (paragraph 9) of the C++ standard] no loads or stores can bed moved into or out from between a load-acquire and store-release pair.

> http://eel.is/c++draft/intro.races#9


It's true that the compiler can't move reads/write out from between a load-acquire and store-release pair, but in my understanding it can certainly move reads/writes into the pair. Or can someone point me to the exact section of the standard where this is prohibited?

I think my stack overflow link above gives a more satisfying explanation on why the example can't deadlock.


I now think it's actually this part of the standard that prevents it: http://eel.is/c++draft/intro.multithread#intro.progress-7

Basically a compiler can not optimize a non-deadlocking program into a potentially deadlocking one.


Yes, it's the same example, but the answer appears to be subtly different. Acquire and release are only "one-way barriers". Consider the link in the sibling comment by spacechild1.


Your reasoning is correct. Also, if it makes you sleep better :-), here's the boost implementation of a spinlock:

https://github.com/boostorg/sync/blob/dfdf317f63d26fb40c703b...

It has been written by an expert in lockfree programming (Tim Blechmann)



The title "Corrected memory order in some cases" is a bit misleading because the original code was correct. The commit just relaxes some unnecessarily strict memory orderings (to improve performance on platforms with weak memory models).

BTW, the spinlock implementation is not affected.


The debate about whether further reordering is possible is an interesting one. I think it's evidence that it's basically not possible for humans to reason about the memory model. The best way to resolve it is to use an automated checking tool based on a formal model. Good choices are CDSChecker[1] or the newer CppMem[2].

[1]: http://plrg.ics.uci.edu/software_page/42-2/

[2]: http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/help.html


Multithreading sure is complicated.

I'm currently playing with multithreading right now. I'm implementing snapshot isolation multiversion concurrency control.

In theory you can avoid locks (except for data structure locks) by creating a copy of the data you want to write and detect conflicts at read and commit time. Set the read timestamp of a piece of data to the transaction timestamp (timestamps are just monotonically increasing numbers) that reads it. If someone with a higher transaction timestamp comes along, they abort and restart because someone got in before them.

At the moment I have something that mostly works but occasionally executes a duplicate. I'm trying to eradicate the last source of bugs but as with anything parallel, it's complicated due to the interleavings.

My test case is to spin up 100 threads, with each thread trying to increment a number. The end numbers should be 101 and 102. If there was a data race, then the numbers will be lower.

https://github.com/samsquire/multiversion-concurrency-contro...


Have you heard of TLA+? If not, I suggest checking it out.

It's kinda made to handle investigating and preventing such rare race conditions.


That assumes that the problem is in their model, not their implementation.


In something like described, it sounds like the design part is where they’re at. If they were optimizing a working algorithm (eg selecting the best memory ordering for some operation), then I’d say they’re in the implementation phase.

What’s not clear to me is if TLA+ can properly model the memory model. If I recall correctly, it doesn’t. You’re just testing your logic. That does still leave for some heavy lifting to be done.


AFAIK it can model the interleaving possibilities at hand, and helps to find somewhat short execution traces that violate some logic/assumptions.


Very cool! I’ve spent the last few months implementing a production MVCC system that’s completely lock-free. It’s really quite simple, but there were initially bugs in all the concurrent algorithms I designed. I wish I had time to learn TLA+ and properly model-check the algorithms, but for now I have to gain confidence just from stress testing and thinking hard about interleavings. The main thing I’ve learned is not to over-optimize. Keeping the algorithms as simple as possible is the only way I can reason about them.


Does your commit function run in parallel too?


For a real battle-tested combination of spinlock and futex, see https://github.com/abseil/abseil-cpp/blob/master/absl/synchr...


Just pointing out: while this implementation is absolutely non- trivial to study, everytime I take a look at the abseil source, I am delighted with the quality of inline comments.


Well isn't that just a normal lock/mutex of the "lightweight" type (only enter kernel on contention)?

You cannot use that in non-preemptible context.


Correctly using a spinlock: Never when preemption is enabled


Therefore, also never when hold times and/or contention are non-trivial (i.e. obviating the PAUSE addition).


That one isn't actually as strict. Just because you used a spinlock doesn't mean you cause pathological degradation.

spinlocks in systems with preemption are only allowed if the scheduler knows about them, though. Otherwise a throughput-optimizing scheduler may cause hold times in the range of seconds, by keeping an aquiring thead active with the holding thread asleep. Such a scheduler wouldn't be suitable for interactive workloads, but for non-networked batch tasks it should be quite efficient (due to a combination of calling the scheduling logic less often (thus wasting less time on it), and less cache contention with the application.

In interrupt handlers and such you may need to use spin locks, as you can't sleep either way. The rule about preemption still applies though, and aside from fairness issues, will take care of preventing pathological hold times.


Yes that's right!


Isn't it nonsense to do this in user-space? Your thread can get preempted at any moment.


It absolutely is, which is why the benchmarks posted should not be taken seriously.

Some environments don't have a scheduler or a "user-space", and so this implementation can suffice, but the problem with this post is that all of the testing and benchmarking are done in user-space, and hence the metrics are not very meaningful.


No. You can configure the kernel run with a different scheduling policy and pretty much prevent preemption. You can even move IO interrupt handlers to other cores to really ensure you have the core.


OK. And who does that, when and why? Can you point to some practical REAL (= existing) examples?


In general, multi-threaded realtime audio software. Example: https://github.com/supercollider/supercollider/tree/develop/...


I've seen things like that in industrial control systems. I think some high-performance networking setups also do things like that, doing almost everything in user-space, actively polling for packets from the hardware.


Another example are OS-bypassing workloads which are tail-latency sensitive. They will prefer to avoid any OS involvement and take charge of everything in user space.


It's nonsense to do this when you can be preempted. But you can run one thread per core and avoid preemption. You can tune the linux kernel to avoid almost all preemption, due to TLB shootdowns etc: https://rigtorp.se/low-latency-guide/


It depends. In a general purpose OS without scheduler trickery one would need at least some backoff & yielding syscall. And it's hugely workload dependent.

But more importantly, even with kernel backed waiting (e.g. using futexes) some userspace spinning can be important for performance. And then pretty much all the concerns from the spinlock apply there as well.


Can happen at any moment != will happen most of the time


Preemption sure is non-deterministic from POV of the thread. But what's the point of it anyway? It's not like you handle HW interrupts in user-space and need to wait for some value in memory mapped register.


> But what's the point of it anyway?

Some people empirically find it improves their applications' performances in some situations.


It's also not rare to end up needing spinlocks within more elaborate lock implementations or within other low level primitives.


I think it may become even easier with C++20 and the addition of wait/notify to std::atomic

https://en.cppreference.com/w/cpp/atomic/atomic/wait


That is a great way to implement a lock, but it won't be a spin lock (which is a good thing by the way). The wait function is a blocking call.


Regarding moving loads and stores into acq/rel critical sections, this is possible, well studied and documented; it is known roach motel reordering since the Java MM standardisation, and generally considered safe.

I never thought whether it could lead to deadlocks or not and it is an interesting question. I assume there musy ve some formal proof against it but I would have to think about it (my first hunch is that if the reordering could cause a deadlock, then the cose wasn't safe in any case, like not acquiring locks in a consistent order).


What is your favourite resource for studying multithreaded programming?


I quite liked the book "C++ Concurrency in Action: Practical Multithreading". It has been a few years since I've read it, but it gave me a better understanding of both C++ and concurrency in general.


I'm surprised nobody has mentioned Linus's rant about spinlocks in userspace: https://www.realworldtech.com/forum/?threadid=189711&curpost...


His rant only applies to preemptible threads. If you don't have preemptible threads spinlocks works great. The linux kernel uses them in non-preemptible contexts.


There should be site like linus-rants.com


How would you test this?


Deploy to production :).

You can use a model checker that understands C++11 memory model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: