Inferencing is much more simple than you think: See the precompute_input_logits(...

dogcomplex · on Dec 4, 2024

I am favoriting this comment for reference later when I start poking around in the base level stuff. I find it pretty funny how simple this stuff can get. Have you messed with ternary computing inference yet? I imagine that shrinks the list even further - or at least reduces the compute requirements in favor of brute force addition. https://arxiv.org/html/2410.00907

ryao · on Dec 4, 2024

No. I still have a long list of things to try doing with llama 3 (and possibly later llama 3.1) on more normal formats like fp32 and in the future, fp16. When I get things running on a GPU, I plan to try using bf16 and maybe fp8 if I get hardware that supports it. Low bit quantizations hurt model quality, so I am not very interested in them. Maybe that will change if good quality models trained to use them become available.

SoothingSorbet · on Dec 4, 2024

> but sadly, the secrets of how I am doing it are in Intel’s proprietary cblas_sgemm_batch() function.

Perhaps you can reverse engineer it?

ryao · on Dec 4, 2024

My plan is to make more attempts at rolling my own. Reverse engineering things is not something that I do, since that would prevent me from publishing the result as OSS.