why is 7B parameters seemingly a magic number?

brucethemoose2 · on July 6, 2023

It matches LLaMA 7B, and it's "cheap" to train for a demo.

If they actually wanted to finetune/train for commodity hardware use, 13b-40b would be a better target.

datastack · on July 6, 2023

I guess going with a parameter count that matches existing models makes it easier to compare benchmarks. Perhaps there is another particular reason like required memory, but momentum is probably also significant.

wilonth · on July 6, 2023

7B params would take 14gb of gpu RAM at fp16 precision. So it would be able to run on 16gb GPUs with 2gb to spare for other small things.

brucethemoose2 · on July 6, 2023

But in practice, no one is running inference at FP16. int8 is more like the bare minimum.

ForOldHack · on July 6, 2023

I have an 8GB, and I am considering two more 8GB, it should I get a single 16GB? The 8GB card was donated, and we need some pipelining... I have 10~15 2GB quadro cards... Apparently useless.

brucethemoose2 · on July 6, 2023

I mean... It depends?

You are just trying to host a llama server?

Matching the VRAM doesn't necessarily matter, get the most you can afford on a single card. Splitting beyond 2 cards doesn't work well at the moment.

Getting a non Nvidia card is a problem for certain backends (like exLLaMA) but fine for llama.cpp in the near future.

AFAIK most backends are not pipelined, the load jumps sequentially from one GPU to the next.

peignoir · on July 6, 2023

just easier to run on smaller hardware