Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to check if any word is zero (2021) (0x80.pl)
27 points by luu on Oct 20, 2022 | hide | past | favorite | 22 comments


For "reasonably small" number of variables to test, branches are likely to be faster. Especially if zero values are either rare or very common, so the branches end up predictable.

SSE4.1 has packed integer min/max instructions which could well be optimal for testing if a big array contains any zeroes.

also: How to check if any word is non-zero? Simply OR them together!


I would expect a small number of variables to fit in cache or registers, so the small extra OOO/data dependency penalty should be much smaller than the cost of a mispredict, no?


It still has to compare every pair of variables, then do the CMOV (which can't be predicted). Then do the same with the result of the first step, computing the minimum for each pair of pairs. Not sure how many of these the CPU can do in parallel!


I don't mean the cmov code, I mean the normal naive testl + sete code. The CPU can do all testl's in parallel and all sete's in parallel. I think this should be faster than branches if you ever have a 0.

I'm assuming your branch code looks something like:

  if (a == 0) {
    return true;
  }
  if (b == 0) {
    return true;
  }
  // ...
  return false;


You mean AND them together.

OR will give you non-zero If one of the inputs is non-zero.


Neither of these are correct.


oh gosh you're right. Why am I getting confused?

I was thinking he wants to check if any of them ARE zero.

OR'ing them together:

    0 || 1 || 0 || 0 = 1
    0 || 0 || 0 || 0 = 0
AND'ing them together:

    1 || 0 || 1 || 1 = 0
    1 || 1 || 1 || 1 = 1


similar: you can do SIMD "within" a single integer type with some fun bit hacks.

ex: for an integer (u32, u64, u128), you can test "all at once" whether any byte == 0, or any nibble == 0, etc... like: [0]

    fn u32_any_byte_eqz(x: u32) -> bool {
        let t = x.wrapping_sub(0x0101_0101);
        let u = !x & 0x8080_8080;
        let v = t & u;
        v != 0
    }
which generates pretty compact assembly:

    example::u32_any_byte_eqz:
            lea     eax, [rdi - 16843009]
            andn    eax, edi, eax
            test    eax, -2139062144
            setne   al
            ret
even more powerful, you can test whether any byte `b` is in the range `m < b < n` "all at once", where `0 <= m <= 127` and `0 <= n <= 128`. [1]

    fn u32_has_byte_between(x: u32, m: u32, n: u32) -> bool {
     let mask = 0x0101_0101;
     let s = mask * 127;
     let y = mask * 128;
     let w = mask * (127 + n);
     let t = mask * (127 - m);
    
     let u = x & s;
     let z = (w - u) & (!x) & (u + t) & y;
    
     z != 0
    }
by changing the mask you can also select which bytes/nibbles you care about.

i've used these to pretty great effect when writing a montecarlo tree search. you just need some care to tightly bit-pack the core structs, then you're golden.

[0]: https://phlip9.com/notes/performance/bit%20hacks/#any-byte-0 [1]: https://phlip9.com/notes/performance/bit%20hacks/#any-byte-b...


I expected the given example to produce assembly that respects short-circuiting, but apparently that's not a requirement. Unoptimized code will have jmps between the conditions, but any optimization at all removes them.

https://cpp.godbolt.org/z/h6sEY8TMz

Always assumed this was a requirement of the language, not "up to the compiler". Perhaps it's only a requirement if the conditions can have side-effects.

Interestingly, every answer in this SO post gets this wrong. https://stackoverflow.com/questions/1799072/c-short-circuiti...


> I expected the given example to produce assembly that respects short-circuiting, [...].

> Always assumed this was a requirement of the language, not "up to the compiler".

But the assembly does respect short-circuiting in C++!

The language specification tells you how to figure out the meaning of a program and it tells you which behaviors you are guaranteed to observe or not observe from the program [0]. This is completely abstract – we could ourselves carry out a C++ program in our heads, or with paper and pencil.

A compiler takes a program as input and writes code for some particular machine as output. If the guaranteed behaviors of the program occur when the machine runs the code, the compiler has conformed to the language specification [1]. The language specification says almost nothing about what this code has to look like.

Even if the language specification states that `b == 0` is not evaluated in `a == 0 || b == 0` when `a` is zero (and that this is required, regardless of the right-hand side), the corresponding assembly might always involve executing a `cmp b, 0` instruction, because what it means to "evaluate b == 0" has to do with the meaning of the C++ program, not the assembly.

[0] See the definition of "observable behavior" quoted in https://stackoverflow.com/a/15718279.

[1] See https://en.cppreference.com/w/cpp/language/as_if. This actually detracts somewhat from my point: talking about allowable transformations implies that there's some preferred way of initially compiling a program, some basic version that's then optimized. Even if compilers happen to work this way, the language specification doesn't.


You're right about it being up to the compiler if the conditions don't have side-effects.

The C++ standard's "as-if" rule[1] allows these kinds of transformations if they do not effect the "observable behavior" of the program. Interestingly, if the values are marked volatile the compiler outputs the "correct" short-circuiting code, even with optimizations[2]

[1] https://en.cppreference.com/w/cpp/language/as_if

[2] https://cpp.godbolt.org/z/dzTh1vPW9


The operation being done based on the result matters - returning 0 or 1 can be done branchlessly, so it's worth making the whole operation branchless. However, if the operation requires a branch one way or another, both gcc and clang do branches for some of the intermediate checks:

https://cpp.godbolt.org/z/7j4qj1aPY


Interesting! It'd be nice to benchmark to see if there are any practical difference in terms of speed.


Yes. In order:

A well predicted branch costs 0. This is the best case and why branching code can sometimes be faster.

An instruction that needs a previous instruction to compute has some pipeline stalls. This is slower than the above and why branchless code can be slower.

A mispredicted branch costs a full pipeline flush. The worst case. This is why we avoid branches but we shouldn't do so blindly since it might not be common in a given program.

So it depends. I discourage removing branches in general unless you have benchmarks to back it up.


GCC turns a fold over a parameter pack into test/sete and turns std::min == 0 into cmp/cmovg: https://godbolt.org/z/TjK7Gz74M


llvm-mca prefers the "fold" over the "min". Intuitively it's more pipelinable. The codegen for "min" has data dependencies that are not there for the "fold" version.

https://godbolt.org/z/1zjKP6z8q

edit:

A mixture of the two approaches possibly outperforms both:

https://godbolt.org/z/WxGWPvYTs

edit2:

Doh, of course to be be correct for the min approach, you need to use unsigned. Does not change the performance though.


Edit: See comments, got something mixed up. This doesn't work at all.

Original post:

In a similar vein you could turn

  a == 0 || b == 0 || c == 0
into

  !(a && b && c).
If we don't need short-circuit evaluation of b == 0 and c == 0 (which the article assumes), we can turn this into

  !(a & b & c)
which yields similar assembly (n + 1 instructions without branches):

  R0 <- and Ra Rb
  R1 <- and R0 Rc
  R2 <- not R1
Would make an interesting benchmark.


!(a & b & c) is not equivalent to !(a && b && c) for int values of a b and c.


Good lord, no, that doesn't work. Try it with a=1, b=2, c=4.


I think a = 1, b = 2 would mean !(a & b) = true implying one of them is zero?



Besides reduction with unsigned min, one can compute a product, which will of course be zero if any element is zero. An OR reduction of the leading/trailing zero counts of the elements would also work, but would be unlikely to be fastest.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: