Hacker Newsnew | past | comments | ask | show | jobs | submit | bick_nyers's commentslogin

The general rule of thumb when assessing MoE <-> Dense model intelligence is SQRT(Total_Params*Active_Params). For Deepseek, you end up with ~158B params. The economics of batch inferencing a ~158B model at scale are different when compared to something like Deepseek (it is ~4x more FLOPS per inference after all), particularly if users care about latency.


There's still a lot of opportunity for software optimizations here. Trouble is that really only two classes of systems get optimizations for Deepseek, namely 1 small GPU + a lot of RAM (ktransformers) and the system that has all the VRAM in the world.

A system with say 192GB VRAM and rest standard memory (DGX station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory run Deepseek @4bit quite quickly because of the power law type usage of the experts.

If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.

This would be an easier job for pruning, but still I think enthusiast systems are going to trend in a way the next couple years that makes these types of software optimizations useful on a much larger scale.

There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4 interconnect which doesn't seem to be using full bandwidth during tensor parallelism) that gets 7 token/s in llama.cpp. A single 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39 times per second, so there's something else going on limiting performance.


> 16x 3090 system

That's about 5KW of power

> that gets 7 token/s in llama.cpp

Just looking at electricity bill it's cheaper to use API of any major providers.

> If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.

That's interesting, it means the model can be cut and those token routed to another closest expert, just in case they happened.


Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers.


A single MI300x has 192GB of vram.


Sad reality is that the MI300x isn't a monolithic die, so the chiplets have internal bandwidth limitations (ofc less severe that using PCIe/nvlink).

In AMD own parlance, the "Modular Chiplet Platform" presents itself as either single-I-don't-care-about-speed-or-latency "Single Partition X-celerator" mode or in multiple-I-actually-totally-do-care-about-speed-and-latency-NUMA-like "Core Partitioned X-celerator" mode.

So you kinda still need to care what-loads-where.


I have never heard of a GPU where a deep understanding of how memory is managed was not critical towards getting the best performance.


I've been using PyCharm for the debugger (and everything else) and VSCode + RooCode + Local LLM lately.

I've heard decent things about the Windsurf extension in PyCharm, but not being able to use a local LLM is an absolute non-starter for me.


MoE inference wouldn't be terrible. That being said, there's not a good MoE model in the 70-160B range as far as I'm aware.


About $12k when Project Digits comes out.


Apple is shipping today. No future promises.


That will only have 128GB of unified memory


128GB for 3K; per the announcement their ConnectX networking allows two Project Digits devices to be plugged into eachother and work together as one device giving you 256GB for $6k, and, AFAIK, existing frameworks can split models across devices, as well, hence, presumably, the upthread suggestion that Project Digits would provide 512GB for $12k, though arguably the last step is cheating.


the reason Nvidia only talk about two machines over the network is I think they only have one network port, so you need to add costs for a switch.


It clearly have two ports. Just watch on the right side of the picture:

https://www.storagereview.com/wp-content/uploads/2025/01/Sto...

You will however get half of the bandwidth and a lot more latency if you have to go through multiple systems.


If you want to split tensorwise yes. Layerwise splits could go over Ethernet.

I would be interested to see how feasible hybrid approaches would be, e.g. connect each pair up directly via ConnectX and then connect the sets together via Ethernet.


Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.


Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.


It's not really possible to say what's "best" because the criteria is super subjective.

I personally like the Spline family, and I default to Spline36 for both upscaling and downscaling in ffmpeg. Most people can't tell the difference between Spline36 and Lanczos3. If you want more sharpness, go for Spline64, for less sharpness, try Spline16.

Edit: As far as I'm aware though OpenCV doesn't have Spline as an option for resizing.


It would not be that slow as it is an MoE model with 37b activated parameters.

Still, 8x3090 gives you ~2.25 bits per weight, which is not a healthy quantization. Doing bifurcation to get up to 16x3090 would be necessary for lightning fast inference with 4bit quants.

At that point though it becomes very hard to build a system due to PCIE lanes, signal integrity, the volume of space you require, the heat generated, and the power requirements.

This is the advantage of moving up to Quadro cards, half the power for 2-4x the VRAM (top end Blackwell Quadro expected to be 96GB).


Yeah, there is a clear bottleneck somewhere in llama.cpp. Even high end hardware is struggling to get good numbers. The theoretical limit should be higher, but it's not yet.

Benchmarks: https://github.com/ggerganov/llama.cpp/issues/11474#issuecom...


It will be slower for a 70b model since Deepseek is an MoE that only activates 37b at a time. That's what makes CPU inference remotely feasible here.


An actual hardcore technical AI "psychology" program would actually be really cool. Could be a good onboarding for prompt engineering (if it still exists in 5 years).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: