If you focus on just the matmuls, no CUDA, no architectures, no infinibands, eve...

If you focus on just the matmuls, no CUDA, no architectures, no infinibands, everything-on-a-chip - put input tokens in input registers, get output tokens from output registers from a model that's baked into gates - you should be able to save some power. Not sure if 10x or 2x or 100x, but certainly there are gains to be had.