As a preface, precompute_input_logits() is really just a generalized version of the forward() function that can operate on multiple input tokens at a time to do faster input processing, although it can be used in place of the forward() function for output generation just by passing only a single token at a time.
Also, my apologies for the code being a bit messy. matrix_multiply() and batched_matrix_multiply() are wrappers for GEMM, which I ended up having to use directly anyway when I needed to do strided access. Then matmul() is a wrapper for GEMV, which is really just a special case of GEMM. This is a work in progress personal R&D project that is based on prior work others did (as it spared me from having to do the legwork to implement the less interesting parts of inferencing), so it is not meant to be pretty.
Anyway, my purpose in providing that link is to show what is needed to do inferencing (on llama 3). You have a bunch of matrix weights, plus a lookup table for vectors that represent tokens, in memory. Then your operations are:
* memcpy()
* memset()
* GEMM (GEMV is a special case of GEMM)
* sinf()
* cosf()
* expf()
* sqrtf()
* rmsnorm (see the C function for the definition)
* softmax (see the C function for the definition)
* Addition, subtraction, multiplication and division.
I specify rmsnorm and softmax for completeness, but they can be implemented in terms of the other operations.
If you can do those, you can do inferencing. You don’t really need very specialized things. Over 95% of time will be spent in GEMM too.
My next steps likely will be to figure out how to implement fast GEMM kernels on my CPU. While my own SGEMV code outperforms the Intel MKL SGEMV code on my CPU (Ryzen 7 5800X where 1 core can use all memory bandwidth), my initial attempts at implementing SGEMM have not fared quite so well, but I will likely figure it out eventually. After I do, I can try adapting this to FP16 and then memory usage will finally be low enough that I can port it to a GPU with 24GB of VRAM. That would enable me to do what I say is possible rather than just saying it as I do here.
By the way, the llama.cpp project has already figured all of this out and has things running on both GPUs and CPUs using just about every major quantization. I am rolling my own to teach myself how things work. By happy coincidence, I am somehow outperforming llama.cpp in prompt processing on my CPU but sadly, the secrets of how I am doing it are in Intel’s proprietary cblas_sgemm_batch() function. However, since I know it is possible for the hardware to perform like that, I can keep trying ideas for my own implementation until I get something that performs at the same level or better.
I am favoriting this comment for reference later when I start poking around in the base level stuff. I find it pretty funny how simple this stuff can get. Have you messed with ternary computing inference yet? I imagine that shrinks the list even further - or at least reduces the compute requirements in favor of brute force addition. https://arxiv.org/html/2410.00907
No. I still have a long list of things to try doing with llama 3 (and possibly later llama 3.1) on more normal formats like fp32 and in the future, fp16. When I get things running on a GPU, I plan to try using bf16 and maybe fp8 if I get hardware that supports it. Low bit quantizations hurt model quality, so I am not very interested in them. Maybe that will change if good quality models trained to use them become available.
My plan is to make more attempts at rolling my own. Reverse engineering things is not something that I do, since that would prevent me from publishing the result as OSS.
See the precompute_input_logits() and forward() functions here:
https://github.com/ryao/llama3.c/blob/master/run.c#L520
As a preface, precompute_input_logits() is really just a generalized version of the forward() function that can operate on multiple input tokens at a time to do faster input processing, although it can be used in place of the forward() function for output generation just by passing only a single token at a time.
Also, my apologies for the code being a bit messy. matrix_multiply() and batched_matrix_multiply() are wrappers for GEMM, which I ended up having to use directly anyway when I needed to do strided access. Then matmul() is a wrapper for GEMV, which is really just a special case of GEMM. This is a work in progress personal R&D project that is based on prior work others did (as it spared me from having to do the legwork to implement the less interesting parts of inferencing), so it is not meant to be pretty.
Anyway, my purpose in providing that link is to show what is needed to do inferencing (on llama 3). You have a bunch of matrix weights, plus a lookup table for vectors that represent tokens, in memory. Then your operations are:
I specify rmsnorm and softmax for completeness, but they can be implemented in terms of the other operations.If you can do those, you can do inferencing. You don’t really need very specialized things. Over 95% of time will be spent in GEMM too.
My next steps likely will be to figure out how to implement fast GEMM kernels on my CPU. While my own SGEMV code outperforms the Intel MKL SGEMV code on my CPU (Ryzen 7 5800X where 1 core can use all memory bandwidth), my initial attempts at implementing SGEMM have not fared quite so well, but I will likely figure it out eventually. After I do, I can try adapting this to FP16 and then memory usage will finally be low enough that I can port it to a GPU with 24GB of VRAM. That would enable me to do what I say is possible rather than just saying it as I do here.
By the way, the llama.cpp project has already figured all of this out and has things running on both GPUs and CPUs using just about every major quantization. I am rolling my own to teach myself how things work. By happy coincidence, I am somehow outperforming llama.cpp in prompt processing on my CPU but sadly, the secrets of how I am doing it are in Intel’s proprietary cblas_sgemm_batch() function. However, since I know it is possible for the hardware to perform like that, I can keep trying ideas for my own implementation until I get something that performs at the same level or better.