The JVM doesn't ever lose access to the bytecode, but Rust compiles down to mach...

pron · on Aug 12, 2019

Bytecode is irrelevant, but yeah, as I wrote, internal pointers into the stack do make moving stacks around difficult.

Rusky · on Aug 12, 2019

Reusing the stack frame layout for a continuation state block is a really neat approach!

Internal pointers into the stack aren't really the reason Rust can't take this approach either, though, unfortunately. Async frames are "pinned" once they start running anyway so the ecosystem is designed around not needing to move them. Async callee frames are, in a sense, contained in their caller's frames- this makes async recursion tricky, but in exchange each static call graph uses a single allocation.

The real issue is basically naasking's #1- not that Rust loses its IR or that it's platform dependent, but that LLVM simply doesn't have the infrastructure to work with stack frames directly. A sufficiently motivated+funded team could add it but Rust doesn't have that luxury.

In fact, this is also the reason that C++20 coroutines individually heap-allocate their frames. No compiler had the infrastructure or was able to put in the work in time. There are also some thorny problems around phase ordering here- stack frame layout typically isn't determined until very very late (to enable optimizations) but program-accessible types in both C++ and Rust need to know their sizes very very early. Java can sidestep this by being slightly higher level.

pron · on Aug 12, 2019

Java sidesteps this because we generate concrete machine instructions and have control over the low-level platform-specific representation. In other words, because our backend is lower level, not higher level. But yeah, if you don't have access to the low level representation -- like Rust and Kotlin -- you don't really have a choice except to wait for your backend to give you that access. Sometimes waiting is a good approach, though. I'm not even sure you need control over the stack representation; it's sufficient to know what it is, which you could maybe do even on top of LLVM (but I don't know LLVM well enough to say for sure).

Rusky · on Aug 12, 2019

Yeah, exactly. I believe you're correct that you just need to know what the stack looks like, not really change it much if at all. Any tweaks you may need to make are probably already available for things like split stacks or green threads.

The problem is that LLVM doesn't even let you see the stack layout until very late, and then only as debug info. You could probably hack something on top but it would probably wind up looking a lot like C++20, with a lot of extra allocation, or like Rust, with all the stack frame layout optimization duplicated into the frontend.

AaronFriel · on Aug 12, 2019

It isn't irrelevant for point #2 above yours, which is that the JVM could rewrite and run/JIT two versions of every function. The JVM monomorphizes plenty already, does it not?

pron · on Aug 12, 2019

It could compile two versions of every function on demand, but there's no need to (although it could be an implementation choice). AFAIK, every compiler in the world already compiles functions into state machines that are suitable for delimited continuations. What would the other version be for?

naasking · on Aug 12, 2019

I'm not convinced they are suitable as-is in all compilers. Continuation capture can only be done at safe points similar to GC. The compiler and runtime need tight integration, which isn't a property of every compiler in the world. LLVM took a long time to integrate intrinsics to handle GC, for example.

pron · on Aug 12, 2019

"Physical" (i.e. non-inlined) subroutine calls are normally such safepoints (although I probably exaggerated when I said "every compiler in the world"). They need to be because a subroutine is a continuation. Upon return, it must be able to restore its state, otherwise it wouldn't be a subroutine. If you're doing any IO, there must such be such a physical call (and a safepoint) precisely where you need it -- at or before the actual IO.

comex · on Aug 12, 2019

It depends what you mean by "continuation capture".

If you just want to suspend some code, subroutine calls will do – but so will any other instruction, as long as you save the values of all registers. That's what the kernel does, after all. Using the subroutine call boundary lets you save a bit of work by saving only the callee-save registers, but it doesn't make that much difference. The problem is that you still need a stack, which prevents coroutines from being lightweight enough.

If you want to move the stack, like Go does... then just like with a GC, you need to track which registers and which stack offsets contain pointers at a given point in the program. That does require some sort of safepoint, which could be a subroutine call. But that's all irrelevant from Rust's perspective, because Rust's low-level pointer model is fundamentally incompatible with moving the stack. In Go, pointers into the stack are themselves only stored on the stack and in registers – never on the heap – so the runtime can easily find them all and adjust them. If you could store pointers into the stack onto the heap, it might still be possible to adjust them, albeit at higher cost, if you had a GC-managed heap where the runtime knows where all the pointers are. But Rust doesn't even have that. Tracking all the pointers in the heap is impossible for multiple reasons:

- Rust programs don't have to use any particular runtime-provided heap; they can ask the OS for memory directly and store pointers there.

- Pointers aren't necessarily stored in an easy-to-maintain way: e.g. Rust supports tagged unions, where the interpretation of one word in memory as either a pointer or integer depends on the value of another word.

Oh, and Rust lets you convert pointers to integers (useful for hashing purposes), so moving pointers behind a program's back would have visible side effects in any case.

So moving the stack is out. That leaves two possibilities:

- Segmented stacks;

- Calculating the required stack space statically.

LLVM can generate code with segmented stack support [1], and Rust used to enable that option. It's not that bad, but it does require every function to begin by checking whether there's enough space left for its stack, and call a runtime function (__morestack) if not. That adds a measurable overhead to function calls. It also requires using a custom thread-local storage slot to track the current stack base, which means you can only call the function on threads where the runtime set up that slot, not any random C thread. That's a problem if you're writing a library meant to be used by C code, which evolved into a major use case for Rust. For these reasons, Rust nixed segmented stack support sometime before 1.0. Re-enabling it is one hypothetical reason why you might want two versions of each function.

As for calculating the required stack space statically, that wouldn't require an alternate code generation mode. But as you were discussing in another subthread, the design of most compiler pipelines makes it very difficult to expose that information to the frontend. Even if that weren't a problem, merely making it theoretically possible to statically calculate stack usage requires imposing severe limits on the program. You can't make dynamic calls where you don't know the callee, and you can't use recursion at all (because then you might need infinite space). And if only a subset of code is async-compatible... then you end up with the equivalent of colored functions anyway. So even if LLVM's pipeline could be wrangled into doing what's necessary, it would only be a small improvement over the higher-level async/await implementation that Rust currently has. (As an alternative, you might be able to create a sort of hybrid mode where only dynamic and recursive calls use segmented stacks, but that would still require an alternate code generation mode.)

The nail in the coffin is that Rust also targets WebAssembly, where you can't do any of the nice low-level things that you can on real machines. This has also been a blocker for guaranteed tail call elimination. I hate it.

[1] https://llvm.org/docs/SegmentedStacks.html

pron · on Aug 12, 2019

> The problem is that you still need a stack, which prevents coroutines from being lightweight enough.

Async/await also needs the same logical stack, it just captures it frame by frame. Continuations can do the same. But as you say, Rust has some particular constraints that make this hard.

> then just like with a GC, you need to track which registers and which stack offsets contain pointers at a given point in the program

Well, you don't need to track all the pointers, just pointers into the stack.

> ... The nail in the coffin is that Rust also targets WebAssembly ...

But yeah, I mentioned some reasons for why it could be hard for some languages -- internal stack pointers and lack of control over the backend are two of them.

But I find it problematic that backend problems force feature design today. If a language wants to be popular 20 years from now it needs to have the patience not to bend to its backends' limitations, particularly as it has few competitors in its niche. So if a language's designers are absolutely certain that async/await is the right design for them for the next few decades, then that's great, but if they decide they want to design a language for the next few decades based on the limitations of WebAssembly in 2019, then that's something I like less. Sometimes there's no choice and a language/feature has to ship, but this is a feature that the world can and has lived without.