Atomic RMW is all you need. Imagine 3 state futex lock of taken, sleeping, and unlocked. Waking thread just needs to see old value but writes unlocked unconditionally.
I don't think this is more expensive than release certainly not because of the line where the write goes in most cache coherency protocols.
Granted this lock does wake up too many people but you did say usually uncontended.
So you want uncontended unlocks to make a syscall to wake the threads that aren’t there?
That syscall will be worse than a CAS just because it’s a syscall. Not to mention it’ll have to lock some kernel data structure just to figure out that there aren’t any threads to wake.
I think I'm not doing a great job of explaining it.
There are three states of the mutex: locked=1, unlocked=0, and sleeping=2. A thread comes along to a locked mutex, and atomically moves it to sleeping and sleeps via futex. The freeing thread unconditionally writes 0 and atomically sees the old value: if 2 wake all, if 1 no need to wake anybody.
Maybe that is equally expensive to a CAS, but I don't think it ultimately is.
I don't think this is more expensive than release certainly not because of the line where the write goes in most cache coherency protocols.
Granted this lock does wake up too many people but you did say usually uncontended.