I spent a couple of weeks trying out local inference solutions for a project. Wrote up my thoughts with some performance benchmarks in a blog.
TLDR -- What these frameworks can do on off the shelf laptops is astounding. However, it is very difficult to find and deploy a task specific model and the models themselves (even with quantization) are so large the download would kill UX for most applications.
There are ways to improve the performance of local LLMs with inference time techniques. You can try with optillm - https://github.com/codelion/optillm it is possible to match the performance of larger models on narrow tasks by doing more at inference.
TLDR -- What these frameworks can do on off the shelf laptops is astounding. However, it is very difficult to find and deploy a task specific model and the models themselves (even with quantization) are so large the download would kill UX for most applications.