With the recent release of Qwen-3 Omni I've decided to put together my first local machine. As much as I just want to pick up a beelink and flash it with Omarchy I think I want a bit more horsepower.
However, the internet seems littered with "clever" loca ai monstrosities that gang together 4-6 ancient nVidia GPU's (priced today to seem like overpriced e-waste) to get lackluster performance from piles of nVidia m60's and P100's? In 2025 this kind of seems like a waste or just bad advice to use hardware this old?
Curious if this find seems like a good source of info regarding staying away from Intel and AMD GPU's for local inference? Might do some training but right now more interested in light RAG and maybe some local coding.
Hoping to build something before the holiday season to keep my office warm with GPU's :).
it's a transparent proxy that automatically launches your selected model with your preferred inference server so that you don't need to manually start/stop the server when you want to switch model
so, let's say I have configured roo code to use qwen3 30ba3b as the orchestrator and glm4.5 air as coder, roo code would call the proxy server with model "qwen3" when using orchestrator mode and then kill llama.cpp with qwen3 and restart it with "glm4.5air"
Hi, I'm looking to transition from renting GPU's from RunPod to hosting some models locally - specifically qwen-2.5 and some lightweight VLM's like Moondream. It looks like the RTX 3060 12gb is a relatively good option but I don't necessarily have a lot of experience with pc hardware, let alone used hardware.
Curious if anyone here has a similar config of 1-4 RTX 3060s? Trying to decide if picking up a few of these is a good value or if I should just continue renting cloud GPU's?
Have you set up hardware/software to run models locally before? Not that it's incredibly hard, but the cloud GPU providers usually simplify a lot of the steps.
The jump from 1 RTX running a local model to multiple RTX is a big jump that I'd wait to tackle until I was very comfortable with 1. Which might mean you want to do something with more VRAM instead of the 3060.
Starting with 1 card doesn't work for everyone; but you're biting off a lot vs. renting a cloud GPU and it can take a lot of time to get a local setup going for the first time.
I've been toying with the idea of building a dedicated machine to continue working on my FPV flight control model.
So far I've just been renting gpu time but it's getting expensive. I also occasionally use qwen3-coder in my homelab but my headless "inference" machine is quite limited with an RTX 3080.
Curious if this build from llamabuilds.ai looks good? In terms of gpu etc? I'm maybe considering buying more GPUs in the future - hard to tell if stacking RTX 6000 ADA's is worthwhile!
I think this is going to change really fast - not in favor of ai companies or investors.
Initially, ChatGPT was kind of a cool toy that felt like something new but not really a threat. Now, even people who aren't engineers see the threat this has to destroy the dignity and financial stability of millions of people.
I like Ai - but at the same time I hope our country picks people over corporate profits... unfortunately I don't see that being likely.
However, the internet seems littered with "clever" loca ai monstrosities that gang together 4-6 ancient nVidia GPU's (priced today to seem like overpriced e-waste) to get lackluster performance from piles of nVidia m60's and P100's? In 2025 this kind of seems like a waste or just bad advice to use hardware this old?
Curious if this find seems like a good source of info regarding staying away from Intel and AMD GPU's for local inference? Might do some training but right now more interested in light RAG and maybe some local coding.
Hoping to build something before the holiday season to keep my office warm with GPU's :).
Thanks!