How to check if any word is zero (2021)

rep_lodsb · on Oct 20, 2022

For "reasonably small" number of variables to test, branches are likely to be faster. Especially if zero values are either rare or very common, so the branches end up predictable.

SSE4.1 has packed integer min/max instructions which could well be optimal for testing if a big array contains any zeroes.

also: How to check if any word is non-zero? Simply OR them together!

tylerhou · on Oct 20, 2022

I would expect a small number of variables to fit in cache or registers, so the small extra OOO/data dependency penalty should be much smaller than the cost of a mispredict, no?

rep_lodsb · on Oct 20, 2022

It still has to compare every pair of variables, then do the CMOV (which can't be predicted). Then do the same with the result of the first step, computing the minimum for each pair of pairs. Not sure how many of these the CPU can do in parallel!

tylerhou · on Oct 20, 2022

I don't mean the cmov code, I mean the normal naive testl + sete code. The CPU can do all testl's in parallel and all sete's in parallel. I think this should be faster than branches if you ever have a 0.

I'm assuming your branch code looks something like:

  if (a == 0) {
    return true;
  }
  if (b == 0) {
    return true;
  }
  // ...
  return false;

drekipus · on Oct 20, 2022

You mean AND them together.

OR will give you non-zero If one of the inputs is non-zero.

pklausler · on Oct 21, 2022

Neither of these are correct.

drekipus · on Oct 21, 2022

oh gosh you're right. Why am I getting confused?

I was thinking he wants to check if any of them ARE zero.

OR'ing them together:

    0 || 1 || 0 || 0 = 1
    0 || 0 || 0 || 0 = 0

AND'ing them together:

    1 || 0 || 1 || 1 = 0
    1 || 1 || 1 || 1 = 1

phlip9 · on Oct 21, 2022

similar: you can do SIMD "within" a single integer type with some fun bit hacks.

ex: for an integer (u32, u64, u128), you can test "all at once" whether any byte == 0, or any nibble == 0, etc... like: [0]

    fn u32_any_byte_eqz(x: u32) -> bool {
        let t = x.wrapping_sub(0x0101_0101);
        let u = !x & 0x8080_8080;
        let v = t & u;
        v != 0
    }

which generates pretty compact assembly:

    example::u32_any_byte_eqz:
            lea     eax, [rdi - 16843009]
            andn    eax, edi, eax
            test    eax, -2139062144
            setne   al
            ret

even more powerful, you can test whether any byte `b` is in the range `m < b < n` "all at once", where `0 <= m <= 127` and `0 <= n <= 128`. [1]

    fn u32_has_byte_between(x: u32, m: u32, n: u32) -> bool {
     let mask = 0x0101_0101;
     let s = mask * 127;
     let y = mask * 128;
     let w = mask * (127 + n);
     let t = mask * (127 - m);
    
     let u = x & s;
     let z = (w - u) & (!x) & (u + t) & y;
    
     z != 0
    }

by changing the mask you can also select which bytes/nibbles you care about.

i've used these to pretty great effect when writing a montecarlo tree search. you just need some care to tightly bit-pack the core structs, then you're golden.

[0]: https://phlip9.com/notes/performance/bit%20hacks/#any-byte-0 [1]: https://phlip9.com/notes/performance/bit%20hacks/#any-byte-b...

hoten · on Oct 20, 2022

I expected the given example to produce assembly that respects short-circuiting, but apparently that's not a requirement. Unoptimized code will have jmps between the conditions, but any optimization at all removes them.

https://cpp.godbolt.org/z/h6sEY8TMz

Always assumed this was a requirement of the language, not "up to the compiler". Perhaps it's only a requirement if the conditions can have side-effects.

Interestingly, every answer in this SO post gets this wrong. https://stackoverflow.com/questions/1799072/c-short-circuiti...

agalunar · on Oct 20, 2022

> I expected the given example to produce assembly that respects short-circuiting, [...].

> Always assumed this was a requirement of the language, not "up to the compiler".

But the assembly does respect short-circuiting in C++!

The language specification tells you how to figure out the meaning of a program and it tells you which behaviors you are guaranteed to observe or not observe from the program [0]. This is completely abstract – we could ourselves carry out a C++ program in our heads, or with paper and pencil.

A compiler takes a program as input and writes code for some particular machine as output. If the guaranteed behaviors of the program occur when the machine runs the code, the compiler has conformed to the language specification [1]. The language specification says almost nothing about what this code has to look like.

Even if the language specification states that `b == 0` is not evaluated in `a == 0 || b == 0` when `a` is zero (and that this is required, regardless of the right-hand side), the corresponding assembly might always involve executing a `cmp b, 0` instruction, because what it means to "evaluate b == 0" has to do with the meaning of the C++ program, not the assembly.

[0] See the definition of "observable behavior" quoted in https://stackoverflow.com/a/15718279.

[1] See https://en.cppreference.com/w/cpp/language/as_if. This actually detracts somewhat from my point: talking about allowable transformations implies that there's some preferred way of initially compiling a program, some basic version that's then optimized. Even if compilers happen to work this way, the language specification doesn't.

mnurzia · on Oct 20, 2022

You're right about it being up to the compiler if the conditions don't have side-effects.

The C++ standard's "as-if" rule[1] allows these kinds of transformations if they do not effect the "observable behavior" of the program. Interestingly, if the values are marked volatile the compiler outputs the "correct" short-circuiting code, even with optimizations[2]

[1] https://en.cppreference.com/w/cpp/language/as_if

[2] https://cpp.godbolt.org/z/dzTh1vPW9

dzaima · on Oct 20, 2022

The operation being done based on the result matters - returning 0 or 1 can be done branchlessly, so it's worth making the whole operation branchless. However, if the operation requires a branch one way or another, both gcc and clang do branches for some of the intermediate checks:

https://cpp.godbolt.org/z/7j4qj1aPY

jcbages · on Oct 20, 2022

Interesting! It'd be nice to benchmark to see if there are any practical difference in terms of speed.

AnotherGoodName · on Oct 20, 2022

Yes. In order:

A well predicted branch costs 0. This is the best case and why branching code can sometimes be faster.

An instruction that needs a previous instruction to compute has some pipeline stalls. This is slower than the above and why branchless code can be slower.

A mispredicted branch costs a full pipeline flush. The worst case. This is why we avoid branches but we shouldn't do so blindly since it might not be common in a given program.

So it depends. I discourage removing branches in general unless you have benchmarks to back it up.

thestoicattack · on Oct 20, 2022

GCC turns a fold over a parameter pack into test/sete and turns std::min == 0 into cmp/cmovg: https://godbolt.org/z/TjK7Gz74M

leni536 · on Oct 20, 2022

llvm-mca prefers the "fold" over the "min". Intuitively it's more pipelinable. The codegen for "min" has data dependencies that are not there for the "fold" version.

https://godbolt.org/z/1zjKP6z8q

edit:

A mixture of the two approaches possibly outperforms both:

https://godbolt.org/z/WxGWPvYTs

edit2:

Doh, of course to be be correct for the min approach, you need to use unsigned. Does not change the performance though.

poorlyknit · on Oct 20, 2022

Edit: See comments, got something mixed up. This doesn't work at all.

Original post:

In a similar vein you could turn

  a == 0 || b == 0 || c == 0

into

  !(a && b && c).

If we don't need short-circuit evaluation of b == 0 and c == 0 (which the article assumes), we can turn this into

  !(a & b & c)

which yields similar assembly (n + 1 instructions without branches):

  R0 <- and Ra Rb
  R1 <- and R0 Rc
  R2 <- not R1

Would make an interesting benchmark.

leni536 · on Oct 20, 2022

!(a & b & c) is not equivalent to !(a && b && c) for int values of a b and c.

pklausler · on Oct 20, 2022

Good lord, no, that doesn't work. Try it with a=1, b=2, c=4.

shaggie76 · on Oct 20, 2022

I think a = 1, b = 2 would mean !(a & b) = true implying one of them is zero?

ForOldHack · on Oct 21, 2022

https://web.archive.org/web/20221021000544/http://0x80.pl/no...

pklausler · on Oct 21, 2022

Besides reduction with unsigned min, one can compute a product, which will of course be zero if any element is zero. An OR reduction of the leading/trailing zero counts of the elements would also work, but would be unlikely to be fastest.