Is this how Groq (https://groq.com/) is so fast, or are they doing something dif...

buildbot · on May 8, 2024

Groq is serving an LLM from (100s of chips worth of) SRAM, so the effective bandwidth thus token generation speed is an order of magnitude higher than HBM. This would 3.5x their speed as well, it is orthogonal.

gdiamos · on May 9, 2024

I'm surprised no one has done this for a GPU cluster yet - we used to do this for RNNs on GPUs & FPGAs at Baidu:

https://proceedings.mlr.press/v48/diamos16.pdf

Or better yet - on Cerebras

Kudos to groq for writing that kernel

wrsh07 · on May 8, 2024

My understanding is that theirs is a pure hardware solution. The hardware is flexible enough to model any current NN architecture.

(Incidentally, there are black box optimization algorithms, so a system as good as grok at inference might be useful for training even if it can't support gradient descent)

throwawaymaths · on May 9, 2024

According to someone I talked to at groq event I was invited to (I did not sign an nda), They are putting ~8 racks of hardware per llm. Of course coordinating those racks to have exact timings between them to pull tokens through is definitely "part of the hard part".