Groq is serving an LLM from (100s of chips worth of) SRAM, so the effective bandwidth thus token generation speed is an order of magnitude higher than HBM. This would 3.5x their speed as well, it is orthogonal.
My understanding is that theirs is a pure hardware solution. The hardware is flexible enough to model any current NN architecture.
(Incidentally, there are black box optimization algorithms, so a system as good as grok at inference might be useful for training even if it can't support gradient descent)
According to someone I talked to at groq event I was invited to (I did not sign an nda), They are putting ~8 racks of hardware per llm. Of course coordinating those racks to have exact timings between them to pull tokens through is definitely "part of the hard part".