Many of us grew up in the PLD era, k-maps, etc. Woz pushed early HW to the limit, with SW APIs that delivered real value. Woz made astute design trade-offs based on full stack knowledge that his peers lacked. The world’s moved on to the GPU (low precision, accelerated parallel compute?) era, but the Woz view point still holds. You can see it in the AI kernel optimizations, or rematerialization methods to push GPU HW to the new limits, and trade-offs need to be made. GPU HW for 4-bit QAT or even 2-bit will dramatically affect the SW (AI) of this era. What trade-offs do you make?
I saw Woz on Northbound 280 “driving” his cherry red Model S, using FSD. He was looking down at the screen the whole time I watched him. Swear he had ssh’d into it.
[edit - Gabe responded]. See this Cloud Run spending cap recommendation [0] to disable billing, which potentially irreversibly deletes resources but does cap spend!
Not sure what “official” means but would direct you to the GCP MaxText [0] framework which is not what this GDM paper is referring to but rather this repo contains various attention implementations in MaxText/layers/attentions.py
Please don’t do this, Zach. We need to encourage more investment in the overall EDA market not less. Garry’s pitch is meant for the dreamers, we should all be supportive. It’s a big boat.
Would appreciate the collective energy being spent instead towards adding to amor refining Garry’s request.
## How can we fit so much more compute on the silicon?
The NVIDIA H200 has 989 TFLOPS of FP16/BF16 compute without sparsity. This is state-of-the-art (more than even Google’s new Trillium chip), and the GB200 launching in 2025 has only 25% more compute (1,250 TFLOPS per die).
Since the vast majority of a GPU’s area is devoted to programmability, specializing on transformers lets you fit far more compute. You can prove this to yourself from first principles:
It takes 10,000 transistors to build a single FP16/BF16/FP8 multiply-add circuit, the building block for all matrix math. The H100 SXM has 528 tensor cores, and each has $4 \times 8 \times 16$ FMA circuits. Multiplying tells us the H100 has 2.7 billion transistors dedicated to tensor cores.
*But an H100 has 80 billion transistors! This means only 3.3% of the transistors on an H100 GPU are used for matrix multiplication!*
This is a deliberate design decision by NVIDIA and other flexible AI chips. If you want to support all kinds of models (CNNs, LSTMs, SSMs, and others), you can’t do much better than this.
By only running transformers, we can fit way more more FLOPS on our chip, without resorting to lower precisions or sparsity.
## Isn’t memory bandwidth the bottleneck on inference?
Honestly all this math sounds a bit fishy to me. A H200 has about 5TB/s bandwidth. If we assume a pure matrix multiply workload, we need to fetch 2 FP16 values, which means we are capped at 1.25 TFLOPs. Even best case scenario, where one of the operands is cached, and the other is an FP8, we are only at 5 TB/s which is way less than what the H200 can do.
I don't get how throwing more ALUs at the problem would make things better, it's very much bandwidth constrained.
That's why Groq exists which has a ton of SRAM on chip.
For [M, K] @ [K, N] read is O(MK + NK) compute is O(MNK)
A quick estimate for compute/bandwidth is min(M, N, K). M is batchsize, so they can just blow that up to get nice looking numbers. On Llama 70B, min(N, K) is 3584 and 7168 for matmul's 1 and 2.
Groq needs a ton of SRAM because they optimized for batch size 1 latency, so M is very small.
This is a datacenter chip. HVAC requirements are more interesting IMO, they seem to be targeting air cooled air edge deployments with that card. They’ll probably wind up with a baseboard design similar to the early v4i TPUs.