The general rule of thumb when assessing MoE <-> Dense model intelligence is SQRT(Total_Params*Active_Params). For Deepseek, you end up with ~158B params. The economics of batch inferencing a ~158B model at scale are different when compared to something like Deepseek (it is ~4x more FLOPS per inference after all), particularly if users care about latency.
There's still a lot of opportunity for software optimizations here. Trouble is that really only two classes of systems get optimizations for Deepseek, namely 1 small GPU + a lot of RAM (ktransformers) and the system that has all the VRAM in the world.
A system with say 192GB VRAM and rest standard memory (DGX station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory run Deepseek @4bit quite quickly because of the power law type usage of the experts.
If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.
This would be an easier job for pruning, but still I think enthusiast systems are going to trend in a way the next couple years that makes these types of software optimizations useful on a much larger scale.
There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4 interconnect which doesn't seem to be using full bandwidth during tensor parallelism) that gets 7 token/s in llama.cpp. A single 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39 times per second, so there's something else going on limiting performance.
Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers.
Sad reality is that the MI300x isn't a monolithic die, so the chiplets have internal bandwidth limitations (ofc less severe that using PCIe/nvlink).
In AMD own parlance, the "Modular Chiplet Platform" presents itself as either single-I-don't-care-about-speed-or-latency "Single Partition X-celerator" mode or in multiple-I-actually-totally-do-care-about-speed-and-latency-NUMA-like "Core Partitioned X-celerator" mode.
128GB for 3K; per the announcement their ConnectX networking allows two Project Digits devices to be plugged into eachother and work together as one device giving you 256GB for $6k, and, AFAIK, existing frameworks can split models across devices, as well, hence, presumably, the upthread suggestion that Project Digits would provide 512GB for $12k, though arguably the last step is cheating.
If you want to split tensorwise yes. Layerwise splits could go over Ethernet.
I would be interested to see how feasible hybrid approaches would be, e.g. connect each pair up directly via ConnectX and then connect the sets together via Ethernet.
Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.
It's not really possible to say what's "best" because the criteria is super subjective.
I personally like the Spline family, and I default to Spline36 for both upscaling and downscaling in ffmpeg. Most people can't tell the difference between Spline36 and Lanczos3. If you want more sharpness, go for Spline64, for less sharpness, try Spline16.
Edit: As far as I'm aware though OpenCV doesn't have Spline as an option for resizing.
It would not be that slow as it is an MoE model with 37b activated parameters.
Still, 8x3090 gives you ~2.25 bits per weight, which is not a healthy quantization. Doing bifurcation to get up to 16x3090 would be necessary for lightning fast inference with 4bit quants.
At that point though it becomes very hard to build a system due to PCIE lanes, signal integrity, the volume of space you require, the heat generated, and the power requirements.
This is the advantage of moving up to Quadro cards, half the power for 2-4x the VRAM (top end Blackwell Quadro expected to be 96GB).
Yeah, there is a clear bottleneck somewhere in llama.cpp. Even high end hardware is struggling to get good numbers. The theoretical limit should be higher, but it's not yet.
An actual hardcore technical AI "psychology" program would actually be really cool. Could be a good onboarding for prompt engineering (if it still exists in 5 years).