Similar or greater inference wins are achieved with speculative decoding which is already widely used, so while this is really interesting (and was tried before with less success AFAIK), it's not yet clear how impactful it would be.
I don’t see where similar wins have ever been achieved.
Speculative decoding can reduce latency, but at the cost of using a lot more compute. The amazing thing here is latency and global throughput improvements would be realized because of the increase in efficiency.
From what I understand speculative decoding can also come with more challenges insofar as trying to maintain overall output quality.