Did anyone see the limitations of this implementations?
Usually you only a limited number of registers available, i.e. number of local vars. This is limited by the oparg byte, i.e. 255. So with more than 255 needed local vars you need a fallback to either an extended opcode or a stack opcode.
This can be optimized in the compiler to limit the practical number of regs to match the CPU, but you still need a fallback, and I didn't see that mentioned in the paper.