Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> As an observation, performance optimized code is almost always effectively single-threaded these days, even when using all the cores on a CPU to very efficiently process workloads.

Why?

Edit: Thanks for all the replies. It seems this applies to data-parallel workloads only. I'd use a GPU for this. An RTX 3090 has around ~10000 CUDA cores (10000 simultaneous operations) v/s just ~10 for CPUs.



Data locality is everything for computational throughput. Having all data private to a single core is extraordinarily efficient compared to sharing data, and particularly mutable data, across cores.

This creates a new problem: how do you balance load across cores? What if the workload is not evenly distributed across the data held by each core? Real workloads are like this! Fortunately, over the last decade, architectures and techniques for dynamic cross-core load shedding have become smooth and efficient while introducing negligible additional inter-core coordination. At this point, it is a mature way of designing extremely high throughput software.


Can you point to any references/resources summarizing the latest cross-core dynamic load shedding techniques? Are they old techniques just now being applied in practice, or has something new been proposed?


I don't know about latest; but try look at

Scheduling Parallel Programs by Work Stealing with Private Deques https://hal.inria.fr/file/index/docid/863028/filename/full.p...


Sometimes your job has few or no inter-task dependencies and so there's no need to share between threads, but there's a heck of a lot of work that needs to be completed.


Essentially any significant task can be made multi threaded for any number of cores, it's just a lot of coding work.


There's a practical question of whether this is true for realistically encountered problems, with a sufficient threshold on both the size and utility of a "significant task" and with realistic numbers of cores.

Without a requirement of utility it's easy to come up with counterexamples from math, eg "does the Collatz starting from Graham's number reach one?" Once you've exhausted cores that can be used for the actual arithmetic, you are still gated by the decision-making at each step so cores cannot work too far "ahead" of each other. There may well be much smarter things we can do than brute force, but that's not "just coding work" at that point.

Theoretically, it doesn't hold - at some point you have split apart everything that can be split, and you are left with some essential chains of data dependency that cannot be further parallelized.


I think the issue is that the memory is the bottleneck in many applications (i.e. 2 loads per cycle, despite many more Functionanl units) and those workloads tend to be very non-embarassinglt parallel.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: