It might be 5 to 10 times slower than a hosted provider but that doesn't really matter when the output is still faster than a person can read. Context wise, for troubleshooting I have never needed over 16k and for the rare occasion when I need to summarise a very large document I can change up the model to something smaller and get a huge context. I have never needed more than 32k though.
Dude he's running locally, and I think this setup is the best bang for the buck if you wanna run locally, we're not comparing to data centers, you gotta keep it in perspective. That's very impressive results for running local. Thanks for the numbers you saved me a chatgpt search :)
CPU-only is really terrible bang for your buck, and I wish people would stop pushing these impractical builds on people genuinely curious in local AI.
The KV cache won't soften the blow the first time they paste a code sample into a chat and end up waiting 10 minutes with absolutely no interactivity before they even get first token.
You'll get an infinitely more useful build out of a single 3090 and sticking to stuff like Gemma 27B than you will out of trying to run Deepseek off a CPU-only build. Even a GH200 struggles to run Deepseek at realistic speeds with bs=1, and there's an entire H100 attached to CPU there: there just isn't a magic way to get "affordable fast effective" AI out of a CPU offloaded model right now.
And that's fine, but the average person asking is already willing to give up some raw intelligence going local, and would not expect the kind of abysmal performance you're likely getting after describing it as "fast".
I setup Deepseek bs=1 on a $41,000 GH200 and got double digit prompt processing speeds (~50 tk/s): you're definitely getting worse performance than the GH200 was, and that's already unacceptable for most users.
They'd be much better served spending less money than you had to spend and getting an actually interactive experience, instead of having to send off prompts and wait several minutes to get an actual reply the moment the query involves any actual context.