Yes, seems like a huge important result for LLM performance.
I’m not aware of any other paper that has offered to increase inference LLM performance to this degree. Has there ever been one before?
At least while also:
- Maintaining output quality. The benchmarks used were somewhat narrow but so far so good.
- Improving not just query latency but also global throughput
- Not requiring more compute
- Having a relatively practical implementation and not adding big challenges and complexity
You could argue the insight is incremental, as it builds on what’s been done with parallel/jacobi decoding. Those previous results were necessary and important, but this may be the one that finally extracts real world value from the promise of parallel decoding.
I’m not aware of any other paper that has offered to increase inference LLM performance to this degree. Has there ever been one before?
At least while also:
- Maintaining output quality. The benchmarks used were somewhat narrow but so far so good.
- Improving not just query latency but also global throughput
- Not requiring more compute
- Having a relatively practical implementation and not adding big challenges and complexity
You could argue the insight is incremental, as it builds on what’s been done with parallel/jacobi decoding. Those previous results were necessary and important, but this may be the one that finally extracts real world value from the promise of parallel decoding.