Hacker Newsnew | past | comments | ask | show | jobs | submit | lifeinthevoid's commentslogin

I run my desktop environment on the iGPU and the AI stuff on the dGPUs.


That's a real good point!

Unfortuatenly, my CPU (5900x) doesn't have an iGPU.

The last 5 years iGPU got a bit out of trend. Now maybe they actually make a lot of sense, as there is a clear use-case which involves having dedicated GPU always in-use which is not gaming (and gaming is different, cause you don't often multi-task while gaming).

I do expect to see a surge in iGPU popularity, or maybe a software improvement to allow having a model always available without constantly hogging the VRAM.


PS: I thought Ollama had a way to use RAM instead of VRAM (?) to keep the model active when not in use, but in my experience that didn't solve the problem.


I built a similar system, meanwhile I've sold one of the RTX 3090's. Local inference is fun and feels liberating, but it's also slow, and once I was used to the immense power of the giant hosted models, the fun quickly disappeared.

I've kept a single GPU to still be able to play a bit with light local models, but not anymore for serious use.


I have a similar setup as the author with 2x 3090s.

The issue is not that it's slow. 20-30 tk/s is perfectly acceptable to me.

The issue is that the quality of the models that I'm able to self-host pales in comparison to that of SOTA hosted models. They hallucinate more, don't follow prompts as well, and simply generate overall worse quality content. These are issues that plague all "AI" models, but they are particularly evident on open weights ones. Maybe this is less noticeable on behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.

I still run inference locally for simple one-off tasks. But for anything more sophisticated, hosted models are unfortunately required.


On my 2x 3090s I am running glm4.5 air q1 and it runs at ~300pp and 20/30 tk/s works pretty well with roo code on vscode, rarely misses tool calls and produces decent quality code.

I also tried to use it with claude code with claude code router and it's pretty fast. Roo code uses bigger contexts, so it's quite slower than claude code in general, but I like the workflow better.

this is my snippet for llama-swap

``` models: "glm45-air": healthCheckTimeout: 300 cmd: | llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.5-Air-GGUF:IQ1_M --split-mode layer --tensor-split 0.48,0.52 --flash-attn on -c 82000 --ubatch-size 512 --cache-type-k q4_1 --cache-type-v q4_1 -ngl 99 --threads -1 --port ${PORT} --host 0.0.0.0 --no-mmap -hfd mradermacher/GLM-4.5-DRAFT-0.6B-v3.0-i1-GGUF:Q6_K -ngld 99 --kv-unified ```


Thanks, but I find it hard to believe that a Q1 model would produce decent results.

I see that the Q2 version is around 42GB, which might be doable on 2x 3090s, even if some of it spills over to CPU/RAM. Have you tried Q2?


well, I tried it and it works for me. llm output is hard to properly evaluate without actually using it.

I read a lot of good comments on r/localllama, with most people suggesting qwen3 coder 30ba3b, but I never got it to work as well as GLM 4.5 air Q1.

As for using Q2, it will fit in vram, but with very small context or spill over to RAM, but with quite an impact on speed depending on your setup. I have slow ddr4 ram and going for Q1 has been a good compromise for me, but YMMV.


What is llama-swap?

Been looking for more details about software configs on https://llamabuilds.ai


https://github.com/mostlygeek/llama-swap

it's a transparent proxy that automatically launches your selected model with your preferred inference server so that you don't need to manually start/stop the server when you want to switch model

so, let's say I have configured roo code to use qwen3 30ba3b as the orchestrator and glm4.5 air as coder, roo code would call the proxy server with model "qwen3" when using orchestrator mode and then kill llama.cpp with qwen3 and restart it with "glm4.5air"


> behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.

Have you tried newer MoE models with llama.cpp's recent '--n-cpu-moe' option to offload MoE layers to the CPU? I can run gpt-oss-120b (5.1B active) on my 4080 and get a usable ~20 tk/s. Had to upgrade my system RAM, but that's easier. https://github.com/ggml-org/llama.cpp/discussions/15396 has a bit on getting that running


I use Ollama which offloads to the CPU automatically IIRC. IME the performance drops dramatically when that happens, and it hogs the CPU making the system unresponsive for other tasks, so I try to avoid it.


I don't believe that's the same thing. That should be the generic offloading that ollama will do to any too big model, while this feature requires MoE models. https://github.com/ollama/ollama/issues/11772 is the feature request for similar on ollama.

One comment in that thread mentions getting almost 30tk/s from gpt-oss-120b on a 3090 with llama.cpp compared to 8tk/s with ollama.

This feature is limited to MoE models, but those seem to be gaining traction with gpt-oss, glm-4.5, and qwen3


Ah, I was not aware of that, thanks. I'll give it a try.


> 20-30 tk/s

or ~2.2M tk/day. This is how we should be thinking about it imho.


Is it? If you're the only user then you care about latency more than throughput.


Not if you have a queue of work that isn't a high priority, like edge compute to review changes in security cam footage or prepare my next day's tasks (calendar, commitments, needs, etc)


If you have a 24 gb 3090. Try out qwen:30b-a3b-instruct-2507-q4_K_M ( ollama )

It's pretty good.


tbf I also run that on a 16GB 5070TI at 25T/S, it's amazing how fast it runs on consumer grade hardware. I think you could push up to a bigger model but I don't know enough about local llama.


Don't need a 3090, it runs really fast on an RTX 2080 too.


Graphics cards are so expensive (list price) they are cheap (no depreciation liquid market)


Did you really claim GPUs have zero depreciation? That’s obviously false.


In case someone's too lazy to enter the address in Google maps, here you go: https://maps.app.goo.gl/oZ5c8aqH1uJ35VaD8


That house is in Belgravia which is one of the wealthiest and most exclusive districts in London. Some of the most expensive real estate in the world, even at that time.


I had only heard of Belgravia from watching the Sherlock Holmes TV episode, https://en.wikipedia.org/wiki/A_Scandal_in_Belgravia It looked like a fancy area from that.


That was the vibe I was getting when visiting the site, they seem to understand fear pretty well. Stay away for your own mental health :-)


Brave also exhibits this behavior, turning off shields fixes the issue.



I have more issues with self control on my computer than on my smartphone. Anyone's got any tips for that one?


Are you using a desktop or laptop? I used to carry my laptop around with me to use whenever wherever as designed, and found myself always using it. Setting aside a dedicated desk space where I limit the use of the laptop made it much easier to just think of it as desktop and less mobile. Now, using it feels like I'm at work, and it is much easier to walk away from it.

Finding other things to do when bored instead of opening a browser is key. You're going to fill the time with something, so you have to find the something else.


There was K9 Web Protection but it was ended in 2019 by Symantec. It was perfect because after setting up password you had to wait one week to unblock it back again :)

You can try LeechBlock. It works as plugin in all browsers.

First thirty seconds are the worst for will :)

So it is better to ask a relative/friend/parent/spouse to set up a password for you - then you cannot unblock the sites back again without them.


I tried LeechBlock and others but even when i enable the most extreme options, i just press right click and remove extension. Bypassing everything.


What worked for me: working in person with others.

I found that it's much harder for me to procrastinate on my laptop when I am working with peers. The repeated focus time on the laptop during work hours 'conditioned' me to use it for work more.


But how much time are you losing to a commute?

I kind of agree with you in a way as I ultimately think that working remote is a bit harder on social health and maybe even physical health of getting out of the house, but in another way I just don't know if I can go back to all the negatives of the office.

I mean, my toilet at home washes my ass with gentle warm water. The work toilet randomly decides to splash toilet water on me with the violent "automatic" flusher after I'm done wiping myself with transparent sandpaper.


You don’t need the other person to be in the same room - a video call works just fine. In fact, it can be even better for productivity since there's less chit-chat.


It is my anecdotal experience that a whole bunch of my current friends are from a pre-pandemic in-office job and I’ve made zero lasting friendships at my remote jobs.


Write your own hosts file and block all websites you don’t want to visit.


Switch to a desktop!


pomodoro


That green checkmark ... what application is this?


Migadu. The tooltip hovering over it shows:

    dkim=pass header.d=smtp.mailtrap.live header.s=rwmt1 header.b=Wrv0sR0r


check marks in email clients usually mean DKIM / other domain verification passed. The attack author truly owns npmjs.help, so a checkmark is appropriate.


If a lot of very smart people didn’t find a single example in all the years knot theory has existed, it obviously is not that obvious.


That is not necessarily true. Knot theory is quite niche, maybe nobody before tried bruteforcing counter examples


We have huge data about knots in protein folding. Given that the proof is a counterexqmple, if it was easy, it should have been observed already in data I feel.


It's not necessarily true. But it's pretty likely. It's worth considering as a possibility.


Google is a corporation maximizing shareholder value. That this goal is not aligned with serving the greater good and freedom should come as no surprise.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: