You use at least half of this stack for desktop setups. You need copying daemons, the ecosystem support (docker-nvidia, etc.), some of the libraries, etc. even when you're on a single system.
If you're doing inference on a server; MIG comes into play. If you're doing inference on a larger cloud, GPU-direct storage comes into play.
It's possible you're underestimating the open source community.
If there's a competing platform that hobbyists can tinker with, the ecosystem can improve quite rapidly, especially when the competing platform is completely closed and hobbyists basically are locked out and have no alternative.
> It's possible you're underestimating the open source community.
On the contrary. You really don't know how I love and prefer open source and love a more leveling playing field.
> If there's a competing platform that hobbyists can tinker with...
AMD's cards are better from hardware and software architecture standpoint, but the performance is not there yet. Plus, ROCm libraries are not that mature, but they're getting there. Developing high performance, high quality code is deceivingly expensive, because it's very heavy in theory, and you fly very close to the metal. I did that in my Ph.D., so I know what it entails. So it requires more than a couple (hundred) hobbyists to pull off (see the development of Eigen linear algebra library, or any high end math library).
Some big guns are pouring money into AMD to implement good ROCm libraries, and it started paying off (Debian has a ton of ROCm packages now, too). However, you need to be able to pull it off in the datacenter to be able to pull it off on the desktop.
AMD also needs to be able to enable ROCm on desktop properly, so people can start hacking it at home.
> especially when the competing platform is completely closed...
NVIDIA gives a lot of support to universities, researchers and institutions who play with their cards. Big cards may not be free, but know-how, support and first steps are always within reach. Plus, their researchers dogfood their own cards, and write papers with them.
So, as long as papers got published, researchers do their research, and something got invented, many people don't care about how open source the ecosystem is. This upsets me a ton, but when closed source AI companies and researchers who forget to add crucial details to their papers so what they did can't be reproduced don't care about open source, because they think like NVIDIA. "My research, my secrets, my fame, my money".
It's not about sharing. It's about winning, and it's ugly in some aspects.
That said, for hobbyist inference on large pretrained models, I think there is an interesting set of possibilities here: maybe a number of operations aren't optimized, and it takes 10x as long to load the model into memory... but all that might not matter if AMD were to be the first to market for 128GB+ VRAM cards that are the only things that can run next-generation open-weight models in a desktop environment, particularly those generating video and images. The hobbyists don't need to optimize all the linear algebra operations that researchers need to be able to experiment with when training; they just need to implement the ones used by the open-weight models.
But of course this is all just wishful thinking, because as others have pointed out, any developments in this direction would require a level of foresight that AMD simply hasn't historically shown.
IDK, I found a post that's 2 years old that has links to doing llama and SD on an Arc [0] (although might be linux only), I feel like a cheap huge ram card would create a 'critical mass' as far as being able to start optimizing, and then from a longer term Intel could promise and deliver on 'scale up' improvements.
It would be a huge shift for them. To go from preferring some (sometimes not quite reached) metric, to, perhaps rightly play the 'reformed underdog'. Commoditize Big-Memory ML Capable GPUs, even if they aren't quite as competitive as the top players at first.
Will the other players respond? Yes. But ruin their margin. I know that sounds cutthroat[1] but hey I'm trying to hypothetically sell this to whomever is taking the reigns after Pat G.
> NVIDIA gives a lot of support to universities, researchers and institutions who play with their cards. Big cards may not be free, but know-how, support and first steps are always within reach. Plus, their researchers dogfood their own cards, and write papers with them.
Ideally they need to do that too. Ideally they have some 'high powered' prototypes (e.x. lets say they decide a 2-gpu per card design with an interlink is feasible for some reason) to share as well. This may not be be entirely ethical[1] in this example of how a corp could play it out, again it's a thought experiment since intel has NOT announced or hinted at a larger memory card anyway.
> AMD also needs to be able to enable ROCm on desktop properly, so people can start hacking it at home
AMD's driver story has always been a hot mess, My desktop won't behave with both my onboard video and 4060 enabled, every AMD card I've had winds up with some weird firmware quirk one way or another... I guess I'm saying their general level of driver quality doesn't lend to hope they'll fix dev tools that soon...
ROCm doesn't really matter when the hardware is almost the same as Nvidia cards. AMD is not selling "cheaper" card with a lot of RAM, what the original poster was asking. (and a reason why people who like to tinker with large model are using Macs).
You're writing as if AMD cares about open source. If they would only actually open source their driver the community would have made their cards better than nvidia ones long ago.
I'm one of those academics. You've got it all wrong. So many people care about open source. So many people carefully release their code and make everything reproducible.
We desperately just want AMD to open up. They just refuse. There's nothing secret going on and there's no conspiracy. There's just a company that for some inexplicable reason doesn't want to make boatloads of money for free.
AMD is the worst possible situation. They're hostile to us and they refuse to invest to make their stuff work.
> If they would only actually open source their driver the community would have made their cards better than nvidia ones long ago.
Software wise, maybe. But you can't change AMD's hardware with a magic wand, and that's where a lot of CUDA's optimizations come from. AMD's GPU architecture is optimized for raster compute, and it's been that way for decades.
I can assure you that AMD does not have a magic button to press that would make their systems competitive for AI. If that was possible it would have been done years ago, with or without their consent. The problem is deeper and extends to design decisions and disagreement over the complexity of GPU designs. If you compare AMD's cards to Nvidia on "fair ground" (eg. no CUDA, only OpenCL) the GPGPU performance still leans in Nvidia's favor.
That would require competently produced documentation. Intel can't do that for any of their side projects because their MBAs don't get a bonus if the tech writers are treated as a valuable asset.
No. I've been reading up. I'm planning to run Flux 12b on my AMD 5700G with 64GB RAM. CPU will take 5-10minutes per image which will be fine for me tinkering while writing code. Maybe I'll be able to get the GPU going on it too.
Point of the OP is this is entirely possible with even an iGPU if only we have the RAM. nVidia should be irrelevant for local inference.
Copying daemons (gdrcopy) is about pumping data in and out of a single card. docker-nvidia and rest of the stack is enablement for using cards.
GPU-Direct is about pumping data from storage devices to cards, esp. from high speed storage systems across networks.
MIG actually shares a single card to multiple instances, so many processes or VMs can use a single card for smaller tasks.
Nothing I have written in my previous comment is related to inter-card, inter-server communication, but all are related to disk-GPU, CPU-GPU or RAM-CPU communication.
Edit: I mean, it's not OK to talk about downvoting, and downvote as you like but, I install and enable these cards for researchers. I know what I'm installing and what it does. C'mon now. :D
i've run inference on Intel Arc and it works just fine so i am not sure what you're talking about. I certainly didn't need docker! I've never tried to do anything on AMD yet.
I had the 16GB arc, and it was able to run inference at the speed i expected, but twice as many per batch as my 8GB card, which i think is about what you'd expect.
once the model is on the card, there's no "disk" anymore, so having more vram to load the model and the tokenizer and whatever else on means there's no disk, and realistically when i am running loads on my 24GB 3090 the CPU is maybe 4% over idle usage. My bottleneck, as it stands, to running large models is vram, not anything else.
If i needed to train (from scratch or whatever) i'd just rent time somewhere, even with a 128GB card locally, because obviously more tensors is better.
and you're getting downvoted because there's literally lm studio and llama.cpp and sd-webui that run just fine for inference on our non-dc, non-nvlink, 1/15th the cost GPUs.
As a preface, precompute_input_logits() is really just a generalized version of the forward() function that can operate on multiple input tokens at a time to do faster input processing, although it can be used in place of the forward() function for output generation just by passing only a single token at a time.
Also, my apologies for the code being a bit messy. matrix_multiply() and batched_matrix_multiply() are wrappers for GEMM, which I ended up having to use directly anyway when I needed to do strided access. Then matmul() is a wrapper for GEMV, which is really just a special case of GEMM. This is a work in progress personal R&D project that is based on prior work others did (as it spared me from having to do the legwork to implement the less interesting parts of inferencing), so it is not meant to be pretty.
Anyway, my purpose in providing that link is to show what is needed to do inferencing (on llama 3). You have a bunch of matrix weights, plus a lookup table for vectors that represent tokens, in memory. Then your operations are:
* memcpy()
* memset()
* GEMM (GEMV is a special case of GEMM)
* sinf()
* cosf()
* expf()
* sqrtf()
* rmsnorm (see the C function for the definition)
* softmax (see the C function for the definition)
* Addition, subtraction, multiplication and division.
I specify rmsnorm and softmax for completeness, but they can be implemented in terms of the other operations.
If you can do those, you can do inferencing. You don’t really need very specialized things. Over 95% of time will be spent in GEMM too.
My next steps likely will be to figure out how to implement fast GEMM kernels on my CPU. While my own SGEMV code outperforms the Intel MKL SGEMV code on my CPU (Ryzen 7 5800X where 1 core can use all memory bandwidth), my initial attempts at implementing SGEMM have not fared quite so well, but I will likely figure it out eventually. After I do, I can try adapting this to FP16 and then memory usage will finally be low enough that I can port it to a GPU with 24GB of VRAM. That would enable me to do what I say is possible rather than just saying it as I do here.
By the way, the llama.cpp project has already figured all of this out and has things running on both GPUs and CPUs using just about every major quantization. I am rolling my own to teach myself how things work. By happy coincidence, I am somehow outperforming llama.cpp in prompt processing on my CPU but sadly, the secrets of how I am doing it are in Intel’s proprietary cblas_sgemm_batch() function. However, since I know it is possible for the hardware to perform like that, I can keep trying ideas for my own implementation until I get something that performs at the same level or better.
I am favoriting this comment for reference later when I start poking around in the base level stuff. I find it pretty funny how simple this stuff can get. Have you messed with ternary computing inference yet? I imagine that shrinks the list even further - or at least reduces the compute requirements in favor of brute force addition. https://arxiv.org/html/2410.00907
No. I still have a long list of things to try doing with llama 3 (and possibly later llama 3.1) on more normal formats like fp32 and in the future, fp16. When I get things running on a GPU, I plan to try using bf16 and maybe fp8 if I get hardware that supports it. Low bit quantizations hurt model quality, so I am not very interested in them. Maybe that will change if good quality models trained to use them become available.
My plan is to make more attempts at rolling my own. Reverse engineering things is not something that I do, since that would prevent me from publishing the result as OSS.
If you're doing inference on a server; MIG comes into play. If you're doing inference on a larger cloud, GPU-direct storage comes into play.
It's all modular.