For "reasonably small" number of variables to test, branches are likely to be faster. Especially if zero values are either rare or very common, so the branches end up predictable.
SSE4.1 has packed integer min/max instructions which could well be optimal for testing if a big array contains any zeroes.
also: How to check if any word is non-zero? Simply OR them together!
I would expect a small number of variables to fit in cache or registers, so the small extra OOO/data dependency penalty should be much smaller than the cost of a mispredict, no?
It still has to compare every pair of variables, then do the CMOV (which can't be predicted). Then do the same with the result of the first step, computing the minimum for each pair of pairs. Not sure how many of these the CPU can do in parallel!
I don't mean the cmov code, I mean the normal naive testl + sete code. The CPU can do all testl's in parallel and all sete's in parallel. I think this should be faster than branches if you ever have a 0.
I'm assuming your branch code looks something like:
if (a == 0) {
return true;
}
if (b == 0) {
return true;
}
// ...
return false;
similar: you can do SIMD "within" a single integer type with some fun bit hacks.
ex: for an integer (u32, u64, u128), you can test "all at once" whether any byte == 0, or any nibble == 0, etc... like: [0]
fn u32_any_byte_eqz(x: u32) -> bool {
let t = x.wrapping_sub(0x0101_0101);
let u = !x & 0x8080_8080;
let v = t & u;
v != 0
}
which generates pretty compact assembly:
example::u32_any_byte_eqz:
lea eax, [rdi - 16843009]
andn eax, edi, eax
test eax, -2139062144
setne al
ret
even more powerful, you can test whether any byte `b` is in the range `m < b < n` "all at once", where `0 <= m <= 127` and `0 <= n <= 128`. [1]
fn u32_has_byte_between(x: u32, m: u32, n: u32) -> bool {
let mask = 0x0101_0101;
let s = mask * 127;
let y = mask * 128;
let w = mask * (127 + n);
let t = mask * (127 - m);
let u = x & s;
let z = (w - u) & (!x) & (u + t) & y;
z != 0
}
by changing the mask you can also select which bytes/nibbles you care about.
i've used these to pretty great effect when writing a montecarlo tree search. you just need some care to tightly bit-pack the core structs, then you're golden.
I expected the given example to produce assembly that respects short-circuiting, but apparently that's not a requirement. Unoptimized code will have jmps between the conditions, but any optimization at all removes them.
Always assumed this was a requirement of the language, not "up to the compiler". Perhaps it's only a requirement if the conditions can have side-effects.
> I expected the given example to produce assembly that respects short-circuiting, [...].
> Always assumed this was a requirement of the language, not "up to the compiler".
But the assembly does respect short-circuiting in C++!
The language specification tells you how to figure out the meaning of a program and it tells you which behaviors you are guaranteed to observe or not observe from the program [0]. This is completely abstract – we could ourselves carry out a C++ program in our heads, or with paper and pencil.
A compiler takes a program as input and writes code for some particular machine as output. If the guaranteed behaviors of the program occur when the machine runs the code, the compiler has conformed to the language specification [1]. The language specification says almost nothing about what this code has to look like.
Even if the language specification states that `b == 0` is not evaluated in `a == 0 || b == 0` when `a` is zero (and that this is required, regardless of the right-hand side), the corresponding assembly might always involve executing a `cmp b, 0` instruction, because what it means to "evaluate b == 0" has to do with the meaning of the C++ program, not the assembly.
[1] See https://en.cppreference.com/w/cpp/language/as_if. This actually detracts somewhat from my point: talking about allowable transformations implies that there's some preferred way of initially compiling a program, some basic version that's then optimized. Even if compilers happen to work this way, the language specification doesn't.
You're right about it being up to the compiler if the conditions don't have side-effects.
The C++ standard's "as-if" rule[1] allows these kinds of transformations if they do not effect the "observable behavior" of the program. Interestingly, if the values are marked volatile the compiler outputs the "correct" short-circuiting code, even with optimizations[2]
The operation being done based on the result matters - returning 0 or 1 can be done branchlessly, so it's worth making the whole operation branchless. However, if the operation requires a branch one way or another, both gcc and clang do branches for some of the intermediate checks:
A well predicted branch costs 0. This is the best case and why branching code can sometimes be faster.
An instruction that needs a previous instruction to compute has some pipeline stalls. This is slower than the above and why branchless code can be slower.
A mispredicted branch costs a full pipeline flush. The worst case. This is why we avoid branches but we shouldn't do so blindly since it might not be common in a given program.
So it depends. I discourage removing branches in general unless you have benchmarks to back it up.
llvm-mca prefers the "fold" over the "min". Intuitively it's more pipelinable. The codegen for "min" has data dependencies that are not there for the "fold" version.
Besides reduction with unsigned min, one can compute a product, which will of course be zero if any element is zero. An OR reduction of the leading/trailing zero counts of the elements would also work, but would be unlikely to be fastest.
SSE4.1 has packed integer min/max instructions which could well be optimal for testing if a big array contains any zeroes.
also: How to check if any word is non-zero? Simply OR them together!