I hope someone will soon post a quantized version that I can run on my macbook p...

annjose · on Jan 30, 2024

Ollama has released the quantized version.

https://ollama.ai/library/codellama:70b https://x.com/ollama/status/1752034686615048367?s=20

Just need to run `ollama run codellama:70b` - pretty fast on macbook.

DaPowaa · on Jan 31, 2024

Really? what kinda macbook pro do you need to run it fast? will a M1 with 16GB ram work? or do we need something super beefy like a M2 fully decked out at 96GB ram to make it run?

israrkhan · on Jan 31, 2024

M1 with 16GB Ram will barely run codellama:13b

kungfupawnda · on Jan 31, 2024

I'm trying to understand how this works.. does it actually run the model on the MacBook Pro? Sorry I am totally new to this...

annjose · on Jan 31, 2024

Yes, it runs a quantized [1] version of the model locally. This version uses low-precision data types to represent reduced weights and activations (8-bit integer instead of 32-bit). The specific model published by Ollama uses 4-bit quantization [2] and that's why it is able to run on MacBook pro.

If you want to try it out, this blog post[3] shows how to do it step by step - pretty straightforward.

[1] https://huggingface.co/docs/optimum/concept_guides/quantizat...

[2] https://ollama.ai/library/codellama:70b

[3] https://annjose.com/post/run-code-llama-70B-locally/

kungfupawnda · on Jan 31, 2024

Thanks, I got all but the 70b model to work. It slows to a crawl on the Mac with 36 gb ram.

annjose · on Jan 31, 2024

Cool. I am running this on M2 Max 64GB. Here is how it looks on my terminal [1]. Btw, the very first run after downloading the model is slightly slow, but the subsequent runs are ok.

[1] https://asciinema.org/a/fFbOEfeTxRShBGbqslwQMfJS4 Note: This recording is in real-time speed, not sped-up.

kungfupawnda · on Jan 31, 2024

I am returning the M3 max 36 GB and picking this model instead. Saves me a grand and it seems to be much more powerful..

mkevac · on Feb 8, 2024

What do you mean you are returning it? It has been used already.

sciencesama · on Jan 30, 2024

ollama run codellama:70b pulling manifest pulling 1436d66b6

1.1 GB/ 38 GB 24 MB/s 25m21s

theLiminator · on Jan 30, 2024

Do you know how much vram is required?

israrkhan · on Jan 31, 2024

if you are asking about Apple silicon based Macs, they have integerated GPUs and do not have dedicated graphics memory (UMA)

For running 4 bit quantized model, with 70B parameters you will need around 35G Ram to load it in the memory. So I sould say a Mac with at least 48G memory. That is M3 Max.