> As an observation, performance optimized code is almost always effectively sin...

jandrewrogers · on March 13, 2021

Data locality is everything for computational throughput. Having all data private to a single core is extraordinarily efficient compared to sharing data, and particularly mutable data, across cores.

This creates a new problem: how do you balance load across cores? What if the workload is not evenly distributed across the data held by each core? Real workloads are like this! Fortunately, over the last decade, architectures and techniques for dynamic cross-core load shedding have become smooth and efficient while introducing negligible additional inter-core coordination. At this point, it is a mature way of designing extremely high throughput software.

anonymousDan · on March 13, 2021

Can you point to any references/resources summarizing the latest cross-core dynamic load shedding techniques? Are they old techniques just now being applied in practice, or has something new been proposed?

electricshampo1 · on March 14, 2021

I don't know about latest; but try look at

Scheduling Parallel Programs by Work Stealing with Private Deques https://hal.inria.fr/file/index/docid/863028/filename/full.p...

tehjoker · on March 13, 2021

Sometimes your job has few or no inter-task dependencies and so there's no need to share between threads, but there's a heck of a lot of work that needs to be completed.

hnuser123456 · on March 13, 2021

Essentially any significant task can be made multi threaded for any number of cores, it's just a lot of coding work.

dllthomas · on March 13, 2021

There's a practical question of whether this is true for realistically encountered problems, with a sufficient threshold on both the size and utility of a "significant task" and with realistic numbers of cores.

Without a requirement of utility it's easy to come up with counterexamples from math, eg "does the Collatz starting from Graham's number reach one?" Once you've exhausted cores that can be used for the actual arithmetic, you are still gated by the decision-making at each step so cores cannot work too far "ahead" of each other. There may well be much smarter things we can do than brute force, but that's not "just coding work" at that point.

Theoretically, it doesn't hold - at some point you have split apart everything that can be split, and you are left with some essential chains of data dependency that cannot be further parallelized.

mhh__ · on March 13, 2021

I think the issue is that the memory is the bottleneck in many applications (i.e. 2 loads per cycle, despite many more Functionanl units) and those workloads tend to be very non-embarassinglt parallel.