If it's so good, why haven't all C libraries adopted the same tricks? My betting...

aseipp · on Oct 2, 2024

Those projects often have dozens of other priorities beyond just one specific API, and obsessing over individual APIs isn't a good way to spend the limited time they have. In any case, as a concrete example to disprove your claim, you can look at malloc and string routines in your average libc on Linux.

glibc's malloc is tolerable but fails handily to more modern alternatives in overall speed and scalability (it fragments badly and degrades over time, not to mention it has a dozen knobs that can deeply impact real life workloads like MALLOC_ARENA_MAX). musl malloc is completely awful in terms of performance at every level; in a multithreaded program, using musl's allocator will destroy your performance so badly that it nearly qualifies as malpractice, in my experience.

musl doesn't even have things like SIMD optimized string comparison routines. You would be shocked at how many CPU cycles in a non-trivial program are spent on those tasks, and yes it absolutely shows up in non-trivial profiles, and yes improving this improves all programs nearly universally. glibc's optimized routines are good, but they can always, it seems, become faster.

These specific things aren't "oh, they're hyper specific optimizations for one architecture that don't generalize". These two things in particular -- we're talking 2-5x wall clock reduction, and drastically improved long-term working set utilization, in nearly all workloads for any given program. These are well explored and understood spaces with good known approaches. So why didn't they take them? Because, as always, they probably had other things to do (or conflicting priorities like musl prioritizing simplicity over peak performance, even when that philosophy is actively detrimental to users.)

I'm not blaming these projects or anything. Nobody sets out and says "My program is slow as shit and does nothing right, and I designed it that way and I'm proud of it." But the idea that the people working on them have made only the perfect pareto frontier of design choices just isn't realistic in the slightest and doesn't capture the actual dynamics of how most of these projects are run.

jart · on Oct 2, 2024

> musl doesn't even have things like SIMD optimized string comparison routines. You would be shocked at how many CPU cycles in a non-trivial program are spent on those tasks

Building GNU Make with Cosmo or glibc makes cold startup go 2x faster for me on large repos compared to building it with Musl, due to vectorized strlen() alone (since SIMD is 2x faster than SWAR). I sent Rich a patch last decade adding sse to strlen(), since I love Musl, and Cosmo is based on it. But alas he didn't want it. Even though he seems perfectly comfortable using ARM's strlen() assembly.

> glibc's malloc is tolerable but fails handily to more modern alternatives in overall speed and scalability

The focus and attention I put into cosmo mutexes isn't unique. I put that care into everything else too, and malloc() is no exception. Cosmo does very well at multi-threaded memory allocation. I can pick benchmark parameters where it outperforms glibc and jemalloc by 100x. I can also pick params where jemalloc wins by 100x. But I'm reasonably certain cosmo can go faster than glibc and musl in most cases while using less memory too. You have Doug Lea to thank for that.

Every day is a good day working on cosmo, because I can always find an opportunity to dive into another rabbit hole. Even ones as seemingly unimportant as clocks: https://github.com/jart/cosmopolitan/commit/dd8544c3bd7899ad...

senderista · on Oct 3, 2024

> You have Doug Lea to thank for that.

Wait, you don't mean your allocator is based on dlmalloc, do you?

jart · on Oct 4, 2024

Yes. Is there something wrong with that? If your program links pthread_create() then cosmo creates a dlmalloc arena for each core on your computer and uses sched_getcpu() to index them combined with raw nsync locks.

senderista · on Oct 5, 2024

I thought that’s pretty much what ptmalloc did (glibc malloc is forked ptmalloc3)?

wrsh07 · on Oct 3, 2024

Another example is hash maps: all the large companies built better maps in cpp (folly, absl), but the number of apps that are performance sensitive and still use std::unordered_map will be astounding forever.

(Why not upstream? ABI compatibility which is apparently a sufficient veto reason for anything in cpp)

menaerus · on Oct 3, 2024

No, that's rubbish that became a meme. There is no universally the best design of a hash-map. Each design always comes with its own trade-offs and so is the case with open-addressing vs separate chaining in conflict resolution. Iterator instability and memory-bandwidth intensive rehashing just to name a few.

Design of your hash-map, or any other algorithm or data-structure, is always and only dictated by your workload. Microbenchmark suite from xyz company/person will hardly represent that reality.

So the more generous answer would be it's not because of "ABI" meme but because std::unordered_map is just fine for the most cases out there. And this is what general purpose standard library is ought to be able to support.

wrsh07 · on Oct 3, 2024

Lol this is so incorrect it's almost funny. Google saved something like single digit % of CPU and memory fleet wide by switching to this map. That's not a micro benchmark

menaerus · on Oct 3, 2024

Learn to read with understanding and learn to have some respect as well. Never have I said that it cannot make a difference but that it's not universal and that it depends on the workload.

wrsh07 · on Oct 5, 2024

Ok let me try to unpack what I'm saying so that we're on the same page:

1. Until c++ [toolchains] do an abi break, you will find better performance in libraries like boost/folly/absl than in the STL. Titus wrote about this in "abi now or never" ~4 years ago. His thesis was "the language is leaving performance on the table, and by default if you care about performance you should not use the STL." Afaict this is all still true. If you think this is a meme, you're wrong, it's existential for companies like Google.

2. You mention a bunch of things that one might consider when picking a hashmap. "There might be a workload where I need one map vs another." Yes, that's true. However, my original statement was saying: if you care about performance you should not use unordered map. This is sort of like saying "if you care about performance you should not use bubble sort." Of course there are lots of considerations when choosing a sorting algorithm. Does it need to be cache tuned/oblivious? Do you have a lot of runs in your data (eg is it mostly sorted?)? HOWEVER, I will happily tell you to prefer a gently optimized quick sort over bubble sort for ~all workloads, and if you need further performance then go consider those things.

The performance of the absl/folly maps is not a joke. It's the basis for the rust hash map. These are good, general purpose hash maps. That's not a meme. It's also not a meme to say that the cpp working groups are generally unwilling to make ABI breaks for small improvements (indeed, in the past decade they haven't made an ABI break to bundle large improvements)

wrsh07 · on Oct 5, 2024

Microbenchmark suite is very different from fleet wide profiling

The whole point of switching every single use of unordered map to node/flat hash map is because it is always better. It always uses less memory (much less memory per node).

Edit: what did I say that was disrespectful? I called you out for being wrong because you're very wrong

dist-epoch · on Oct 2, 2024

Politics, not-invented-here syndrome, old maintainers.

It takes forever to change something in glibc, or the C++ equivalent.

There are many kinds of synchronization primitives. pthreads only supports a subset. If you are limiting yourself to them, you are most likely leaving performance on the table, but you gain portability.

jitl · on Oct 2, 2024

> I can't imagine it's that the various libc authors aren't keeping up in state-of-the-art research on OS primitives...

is this sarcasm?

(I don't know any libc maintainers, but as a maintainer of a few thingies myself, I do not try to implement state-of-the-art research, I try to keep my thingies stable and ensure the performance is acceptable; implementing research is out of my budget for "maintenance")

amiga386 · on Oct 2, 2024

But if you maintain a few thingies, you'd probably know about rival thingies that do a similar thing, right?

If the rival thingies got a speed boost recently, and they were open source, you'd want to have a look at how they did it and maybe get a similar speed boost for yourself.

tialaramex · on Oct 2, 2024

This is nowhere near as common as you seem to think, and mostly only happens for the narrow cases where somebody is obsessed with a particular problem so that they'd actually want to read other people's solutions. Most of the implementers do not have that sort of relationship to a problem they solved in the past.

If in December you make a general purpose stable sort that's 25% faster than his, Orson Peters is probably going to read your code and try to apply ideas from it. But sorting is the thing Peters really cares about, the people who wrote the stable sort in say Microsoft's STL (their C++ standard library implementation) even if they still work there won't care enough to go back and change that unless told to do so.

jitl · on Oct 2, 2024

It depends on the calculus about (time) budget and stability. Maybe I consider performance already acceptable, and don't have time budget to investigate beyond that. Maybe I look at "nsync", see its mutex (may) change the fairness semantics, and then decide not to adopt it because this may break my callers; or its enough that it may change the fairness semantics, and I don't have the budget to test nsync or a new implementation based on the nsync algorithm to determine if the semantics differ.

leni536 · on Oct 3, 2024

Are there ABI considerations for changing a pthread mutex implementation?

pnt12 · on Oct 3, 2024

> If it's so good, why haven't all C libraries adopted the same tricks?

A man and a statiscian are walking down the street when they see a 50€ bill. The statistician keeps walking but the man stops and says "hey, look at this cash on the floor?". But the statistician, uninpressed, says: "must be fake, or someone would have already picked it up". And continues walking. The other man grabs the cash and takes it.

Const-me · on Oct 2, 2024

My guess is, because what’s in these current standard libraries and OSes are good enough.

Synchronizing multiple CPU cores together is fundamentally slow, there’s no ways around it. They are far apart on the chip, and sometimes even on different chips with some link between. When measuring time with CPU cycles that latency is rather slow.

Possible to avoid with good old software engineering, and over time people who wanted to extract performance from their multi-core CPUs became good at it.

When you’re computing something parallel which takes minutes, you’ll do great if you update the progress bar at a laughable 5 Hz. Synchronizing cores 5 times each second costs nothing regardless of how efficient is the mutex.

When you’re computing something interactive like a videogame, it’s often enough to synchronize cores once per rendered frame, which often happens at 60Hz.

Another notable example is multimedia frameworks. These handle realtime data coming at high frequencies like 48 kHz for audio, and they do non-trivial compute in these effect transforms and codecs so they need multiple cores. But they can tolerate a bit of latency so they’re just batching these samples. This dramatically saves IPC costs because you only need to lock these mutexes at 100Hz when batching 480 samples = 10ms of audio.

monocasa · on Oct 3, 2024

> When you’re computing something interactive like a videogame, it’s often enough to synchronize cores once per rendered frame, which often happens at 60Hz.

That's not really the right way to do it. If you're actually using multiple cores in your code, you want to build a data dependency graph and let the cores walk that graph. Max node size ends up being something loosely like 10k items to be processed. You'll typically see hundreds of synchronization points per frame.

This is the base architecture of stuff like Thread Building Blocks and rust's rayon library.

astrange · on Oct 2, 2024

> They are far apart on the chip, and sometimes even on different chips with some link between.

They aren't always. On a NUMA machine some are closer than others; M-series is an example. The cores are in clusters where some of the cache is shared, so atomics are cheap as long as it doesn't leave the cluster.