The things Rust wants more than other languages are support for fast array bound...

0815test · on March 20, 2019

Actually, I'd say that support for fast range checks is less important in Rust than in some other languages, because the Rust compiler is quite good at removing the checks when a value can be proven to be in the right range. In general, Rust seems to cope quite well with current CPU architectures, although I do think that there's an open question as to how to support viable parallelism that not quite as fine-grained as current ILP approaches (pipelining, superscalar), nor quite as coarse as to be effectively exploited via multiple threads. ISTM that this is one portion of the design space that Rust might address quite nicely.

hyperman1 · on March 20, 2019

x86 seems reasonable to me:

Fast array bounds checking is, handling the lower bound by unsigned integer math :

  cmp index,bound
  ja crash_handler

Overflow handling is mostly

   jo crash_handler

That's 1 or 2 instructions and the jump has perfect prediction. I'd assume the cost is negligible compared with jump mispredicts and cache misses.

I read somewhere the main problem is LLVM not being quite capable of optimizing these decently: It has to handle all these extra jumps which causes lots of extra basic blocks.

est31 · on March 21, 2019

Yeah, part of the unexplored optimization potential that Rust has is that LLVM was originally written with C in mind, and Rust achieves great results with it, but sometimes it can't use the additional info that the better typesystem provides because nobody wrote the code in LLVM, or because LLVM has bugs so the Rust devs have to turn off that additional info.

monocasa · on March 20, 2019

A jump, but assume not taken and don't populate the branch predictor might be useful, otherwise you can end up with a death by a thousand cuts thing. Superfluous branches thrash your branch predictor tables similar to how cache pressure thrashs.

simcop2387 · on March 20, 2019

I seem to recall, that at least jo has terrible pipelining implications because of the dependency on the flag register that way, no idea about the bounds check (might be a better way)

temac · on March 20, 2019

I don't see why JO would be any different from other Jcc, especially since JL is SF ≠ OF. Maybe you were thinking of the problem with INC/DEC and CF, but even then that's a problem of INC/DEC, not of Jcc.

nkurz · on March 20, 2019

It's not usually a big difference, but on modern Intel there is a some difference in performance between the different CMP/JCC options. The more common ones will "fuse" with the CMP and be executed as a single µop, but the rarer ones (like JO and JS) do not fuse with CMP, and thus can add a cycle of delay (and have the overhead associated with executing another µop). The optimization is called "macro op fusion". Details here https://en.wikichip.org/wiki/macro-operation_fusion and here https://www.agner.org/optimize/microarchitecture.pdf (pages 108 and 125).

khuey · on March 20, 2019

Fixing that wouldn't seem to require any architectural changes in x86 though, it's just that Intel hasn't cared enough about JO and friends to optimize them this way.

bsder · on March 20, 2019

Unfortunately, fast array bounds checks and fast integer overflow checks are anathema to speculative execution so aren't happening any time soon.

I'm becoming convinced that we really need to just go back to the Alpha 21164 architecture and stamp out multiple copies with really fast interconnect.

nkurz · on March 21, 2019

Could you explain your reasoning? I would think that bounds and overflowing checking should have fantastic performance on modern processors. The almost-always-false test gets correctly predicted as false, speculative execution chooses the correct branch without delay, and a few cycles later (but without visible delay) some simple math confirms that the correct branch was chosen. In my mind, the hardware support is already there and almost perfect; it's the languages and compilers that are lagging. But maybe I'm missing something obvious. Where do you think the issue is?

bsder · on March 21, 2019

> he almost-always-false test gets correctly predicted as false, speculative execution chooses the correct branch without delay, and a few cycles later (but without visible delay) some simple math confirms that the correct branch was chosen.

So, why don't compilers do this already? Nothing stops them. The answer is that they lose a ton of performance.

There is a lot of housekeeping in order to keep track of speculative execution and it happens on every single mathematical operation in your regime. And you pay that continually for an operation that happens "very rarely" as you point out.

Overflow/underflow probably isn't so bad, but array bounds checks are very bad. You have at least one bound that is variable and so causes an extra "Compare to X" where X is a non-zero constant and so needs to be loaded. So your tight loops all look like "increment I; compare I to 0; compare I to X; do real work" That's two extra instructions that the speculative execution system has to keep track of every loop and it eats a lot of the speculative execution resources very quickly.

An in-order chip like the 21164 doesn't execute the loops any faster, but it doesn't have to pay hardware chip area because of speculative execution. Given that everything is moving to high multi-core, it would be better to let a stuck core give way to another quickly than to try speculate deeper--especially as SSDs continue to increase in speed--and especially because, quite often in a battery operated world, your would rather shut down while waiting.

In addition, if you trap on these, it will break an enormous amount of software in existence. Look at how many people went absolutely ape over gcc suddenly doing aggressive optimizations on undefined behavior.

(And, by the way, every microprocessor used to compute underflow on arithmetic instructions (8080, Z80, 6800, etc.), so you need to ask why that went away.)

BeeOnRope · on March 25, 2019

I had a bit of a hard time following your argument, but nkurz is right here in the sense that things like overflow and bounds checks are practically the poster child for things that perform well on wide OoO processors. They are leaves in dependency graph and hence don't add to dependency chains, and on wide designs the ALU op(s) will often execute almost for free.

It is in-order designs that take a slightly larger relative penalty since in some sense "all instructions matter" there so bounds checks cost similar to surrounding operations that do real work.

You make a lot of valid observation about the cost of actually implementing a modern, deep and wide out of order architecture: it is much harder and uses more silicon that some in-order designs, but once you have it (and largely that's what we have today as general purpose CPUs), the type of branches that are involved in overflow and bounds checks are not costly.

0815test · on March 21, 2019

> ... So your tight loops all look like "increment I; compare I to 0; compare I to X; do real work" ...

Compilers are generally smart enough to hoist these comparisons out of the loop, at least in a static, AOT-compiled language like Rust.

You're right though that in-order chips are a lot more power-efficient, and the in-order approach is the one that's taken in most RISC-V implementations (which seems to be a highly comparable architecture to the Alpha you mention).

devit · on March 21, 2019

In Rust you can probably get away with not speculating and thus having "imprecise exceptions" for overflow and maybe array reads (if you don't care about side channels), since panic essentially destroys all non-Mutex-protected accessible memory anyway and Mutex-protected memory is made inaccessible via poisoning (as long as they don't get moved through things like library calls and atomic ops).

It would require a minor compatibility break, significant checking work and changes to unsafe code to make it work, but it should be doable.

Array writes do need precise checking though, since otherwise you can write to arbitrary memory.

aswanson · on March 20, 2019

Could the concept of "ownership" be optimized for somehow in hardware?

half-kh-hacker · on March 20, 2019

I'm not a Rust expert, but I think ownership and borrowing go away after compile-time.

aswanson · on March 21, 2019

Of course, but what pre-compile architectural decisions could be made to accommodate such a higher level construct?

incog-neato · on March 21, 2019

The point of the borrow checker is that it's run at compile time rather than runtime. Putting something similar into the architecture necessarily implies a runtime check, which is similar to the idea of a smart pointer, which exists in Rust as well as other languages.

steveklabnik · on March 21, 2019

That’s correct.

oconnor663 · on March 21, 2019

> fast integer overflow checks

Rust's implicit overflow checks are only in debug mode I think. Overflow is defined to wrap in release mode.

steveklabnik · on March 21, 2019

Yes, due to the hit in performance. We'd love to turn them on unconditionally, and the language is spec'd so that we can someday, but since overflow is not a memory safety issue on its own, and there is a performance hit when checking, this is the current compromise.

digikata · on March 20, 2019

I suspect ADA would like those too, if available.