Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Realistically you probably just want to look at the file size on huggingface and add ~2 gigs for OS/Firefox tabs and and a bit for context (depends but lets say 1-2)

The direct parm conversion math tends to be much less reliable than one would expect once quants are involved.

e.g.

7B @ Q8 = 7.1gb [0]

30B @ Q8 = 34.6gb [1]

btw you can also roughly estimate expected output speed too if you know the device memory throughput. Noting that this doesn't work for MoEs

Also recently discovered that in CPU mode llama.cpp does memory mapping. For some models it loads less than a quarter into memory.

https://huggingface.co/TheBloke/Llama-2-7B-GGUF/tree/main

https://huggingface.co/TheBloke/LLaMA-30b-GGUF/tree/main



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: