I've recently experienced what I believe was likely branch predictor failure. I rewrote essentially the same algorithm, with the same complexity, in a completely new way, to reduce constant factors. The new code was beautiful, and I was expecting perhaps 15-20% performance improvement. (40% of the runtime in the old version was in one function, the new version should, and did, call that function with less data to grind on average). Memory access patterns should be identical. Grand.
So what happened with my standard benchmark on Haswell (GCC 5.3, -march=native -O2)?
Old version, without PGO: ~8 minutes (baseline)
Old version, with PGO on: ~9 minutes (yes, PGO made it slower)
New version, without PGO: 11 minutes (60% slower)
New version, with PGO on: 7 minutes (~15% faster)
All of these benchmarks were replicated with a common set of functions tagged noinline, and I inspected the assembly to see if there was a difference with regard to loop unrolling. Nope, no unrolling. Afaict it appeared to be entirely to do with the way GCC had laid out code with respect to branches. The differences were minor in key areas and seemed like arbitrary compiler choices.
I eventually narrowed this down with callgrinds branch-predictor simulator to a few hot spots. I added a redundant if-statement to check for and short-circuit a common case, even though it shouldn't have mattered, and a few __builtin_expects, and suddenly I was getting the 15% speedup without PGO.
So what happened with my standard benchmark on Haswell (GCC 5.3, -march=native -O2)?
Old version, without PGO: ~8 minutes (baseline)
Old version, with PGO on: ~9 minutes (yes, PGO made it slower)
New version, without PGO: 11 minutes (60% slower)
New version, with PGO on: 7 minutes (~15% faster)
All of these benchmarks were replicated with a common set of functions tagged noinline, and I inspected the assembly to see if there was a difference with regard to loop unrolling. Nope, no unrolling. Afaict it appeared to be entirely to do with the way GCC had laid out code with respect to branches. The differences were minor in key areas and seemed like arbitrary compiler choices.
I eventually narrowed this down with callgrinds branch-predictor simulator to a few hot spots. I added a redundant if-statement to check for and short-circuit a common case, even though it shouldn't have mattered, and a few __builtin_expects, and suddenly I was getting the 15% speedup without PGO.