Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If this is the full fp16 quant, you'd need 2TB of memory to use with the full 131k context.

With 44GB of SRAM per Cerebras chip, you'd need 45 chips chained together. $3m per chip. $135m total to run this.

For comparison, you can buy a DGX B200 with 8x B200 Blackwell chips and 1.4TB of memory for around $500k. Two systems would give you 2.8TB memory which is enough for this. So $1m vs $135m to run this model.

It's not very scalable unless you have some ultra high value task that need super fast inference speed. Maybe hedge funds or some sort of financial markets?

PS. The reason why I think we're only in the beginning of the AI boom is because I can't imagine what we can build if we can run models as good as Claude Opus 4 (or even better) at 1500 tokens/s for a very cheap price and tens of millions of context tokens. We're still a few generations of hardware away I'm guessing.



> With 44GB of SRAM per Cerebras chip, you'd need 45 chips chained together. $3m per chip. $135m total to run this.

That's not how you would do it with Cerebras. 44GB is SRAM, so on chip memory, not HBM memory where you would store most of the params. For reference one GB200 has only 126MB of SRAM, if you tried to estimate how many GB200 you would need for a 2TB model just by looking at the L2 cache size you would get 16k GB200 aka ~600M$, obviously way off.

Cerebras uses a different architecture than Nvidia, where the HBM is not directly packaged with the chips, this is handled by a different system so you can scale memory and compute separately. Specifically you can use something like MemoryX to act as your HBM which will be high speed interconnected to the chips SRAM, see [1]. I'm not at all an expert in Cerebras, but IIRC you can connect up to like 2PB of memory to a single Cererbas chip, so almost 1000x the FP16 model.

[1]: https://www.cerebras.ai/blog/announcing-the-cerebras-archite...


  That's not how you would do it with Cerebras. 44GB is SRAM, so on chip memory, not HBM memory where you would store most of the params. For reference one GB200 has only 126MB of SRAM, if you tried to estimate how many GB200 you would need for a 2TB model just by looking at the L2 cache size you would get 16k GB200 aka ~600M$, obviously way off.
Yes but Cerebras achieves its speed by using SRAM.


There is no way not to use SRAM on a GPU/Cerebras/most accelerators. This is where the cores fetch the data.

But that doesn’t mean you are only using SRAM, that would be impractical. Just like using a CPU just by storing stuff in the L3 cache and never going to the RAM. Unless I am missing something from the original link, I don’t know how you got to the conclusion that they only used SRAM.


> Just like using a CPU just by storing stuff in the L3 cache and never going to the RAM. Unless I am missing something from the original link, I don’t know how you got to the conclusion that they only used SRAM.

That's exactly how Graphcore's current chips work, and I wouldn't be surprised if that's how Cerebras's wafer works. It's probably even harder for Cerebras to use DRAM because each chip in the wafer is "landlocked" and doesn't have an easy way to access the outside world. You could go up or down, but down is used for power input and up is used for cooling.

You're right it's not a good way to do things for memory hungry models like LLMs, but all of these chips were designed before it became obvious that LLMs are where the money is. Graphcore's next chip (if they are even still working on it) can access a mountain of DRAM with very high bandwidth. I imagine Cerebras will be working on that too. I wouldn't be surprised if the abandon WSI entirely due to needing to use DRAM.


I know Groq chips load the entire model into SRAM. That's why it can be so fast.

So if Cerebras uses HBM to store the model but stream weights into SRAM, I really don't see the advantage long term over smaller chips like GB200 since both architectures use HBM.

The whole point of having a wafer chip is that you limit the need to reach out to external parts for memory since that's the slow part.


> I really don't see the advantage long term over smaller chips like GB200 since both architectures use HBM.

I don’t think you can look at those things binarily. 44GB of SRAM is still a massive amount. You don’t need infinite SRAM to get better performances. There is a reason NVidia is increasing the L2 cache size with every generation rather than just sticking with 32MB if it really changed nothing to have a bit more. The more SRAM you have the more you are able to mask communication behind computation. You can imagine with 44GB being able to load the weights of layer N+1 into SRAM while computing layer N, thereby entirely negating the penalty of going to HBM (same idea as FSDP).


> You can imagine with 44GB being able to load the weights of layer N+1 into SRAM while computing layer N, thereby entirely negating the penalty of going to HBM (same idea as FSDP).

You would have to have an insanely fast bus to prevent I/O stalls with this. With a 235B fp16 model you’d be streaming 470GiB of data every graph execution. To do that 1000tok/s, you’d need a bus that can deliver a sustained ~500 TiB/s. Even if you do a 32 wide MoE model, that’s still about 15 TiB/s of bandwidth you’d need from the HBM to avoid stalls at 1000tok/s.

It would seem like this either isn’t fp16 or this is indeed likely running completely out of SRAM.

Of course Cerebas doesn’t use a dense representation so these memory numbers could be way off and maybe that is SRAM+DRAM combo


> I don’t know how you got to the conclusion that they only used SRAM.

Because they are doing 1,500 tokens per second.


what are the bandwidth/latency of memoryX? those are the key parameters for inference


Well MemoryX compared to H100 HBM3 the key details are that MemoryX has lower latency, but also far lower bandwidth. However the memory on Cerebras is scales a lot more over NVidia. You need a cluster of H100's to create a model, as only way to scale the memory, Cerbras is more suited to that aspect, Nvidia do their scaling in tooling, with Cerbras doing theirs in design via there silicon approach.

That's my take on it all, not many apples to oranges comparisons to work from on these two system for even rolling down the same slope.


No way an offchip HBM has same or better bandwidth then onchip


> MemoryX has lower latency, but also far lower bandwidth


Yeah sure, but if you do that you are heavily dropping the token/s for a single user. The only way to recover from that is continuous batching. This could still be interesting if the KV caches of all users fit in SRAM though.


> but if you do that you are heavily dropping the token/s for a single user.

I don’t follow what you are saying and what “that” is specifically. Assuming it’s referencing using HBM and not just SRAM, this is not optional on a GPU, SRAM is many order of magnitudes too small. Data is constantly flowing between HBM and SRAM by design, and to get data in/out of your GPU you have to go through HBM first, you can’t skip that.

And while it is quite massive on a Cerebras system it is also still too small for very large models.


> With 44GB of SRAM per Cerebras chip, you'd need 45 chips chained together. $3m per chip. $135m total to run this.

That on-chip SRAM memory is purely temporary working memory and does need to hold the entire model weights. The Cerebras chip works on a sparse weights representation, streams non-zero off their external memory server and the cores work in a transport-triggered dataflow manner.


There is no reason to run models for inference at static fp16, modern quantisation formats dynamically assign precision to the layers that need them, an average of 6bpw is practical imperceptible from full precision, 8bpw if you really want to squeeze every tiny last drop out of it (although it's unlikely it will be detectable). That is a huge memory saving.


> dynamically assign precision to the layers that need them

Well now I'm curious; how is a layer judged on its relative need for precision? I guess I still have a lot of learning to do w.r.t. how quantization is done. I was under the impression it was done once, statically, and produced a new giant GGUF blob or whatever format your weights are in. Does that assumption still hold true for the approach you're describing?


Last I checked they ran some sort of evals before and after quantisation and measured the effect. E.g Exllama-v2 measures the loss while reciting Wikipedia articles.


Within the GGUF (and some other formats) you'll see each layer gets its own quantisation, for example embeddings layers are usually more sensitive to quantisation and as such are often kept at Q8 or FP16. If you run GGUF-dump or click on the GGUF icon on a model in huggingface you'll see.



What quantization formats are these? All the OSS ones from GGML apply a uniform quantization


GGML hasn't been a thing for some time, and GGUF (its successor) has features such as "importance matrix" quantization that is all about quantizing adaptively. Then there's all the stuff that Unsloth does, e.g.: https://unsloth.ai/blog/dynamic-v2



No they don't. GGML is non-uniform. Each layer gets its own level of quantisation.


Our chips don't cost $3M. I'm not sure where you got that number but its wildly incorrect.


So how much does it cost? Google search return $3m. Here's your chance to tell us your real price if you disagree.


He also didn't argue about the rest of the math so it's likely correct that the whole model needs to be in SRAM :)


Agree. The OP is picking from dated and not exactly applicable data. I estimate you could be down to 20% of that by now if you were optimizing for costs. An issue that is real for you guys is software stack tractability; i.e. the ability of your team to bring on board models in a timely manner. Maybe because all models are optimized for GPUs, but it's something that I would get on top of if its fixable. Obviously, you must be taking into account these issues and competitive performance in future iterations of your chips also.


Is it actually $4M?


Do you distinguish betwen "chips" and the wafer-scale system? Is the wafer-scale system significantly less than 3MM?

EDIT: online it seems TSMC prices are about 25K-30K per wafer. So even 10Xing that a wafer-scale system should be about 300K.


Are you the CEO of Cerebras? (Guessing from the handle)


I wonder why he (Andrew Feldman) didn't retort to the SRAM vs HBM memory incorrect assumption made by the OP comment; maybe he was so busy that he couldn't even cite the sibling comment? That's a bigger wrong assumption than being off by maybe 30-50% at most on Cerebras's single server price (it definitely doesn't cost less than $1.5-2M).


I have followed Cerebras for sometime. Several comments: 1. Yes I think that is Feldman. I have seen him intervene at hacerknews before thou don't remember the handle specifically 2. Yes, the OP technical assumption is generally correct. Cerebras load the model onto the wafer to get the speed. It's the whole point of their architecture to minimize the distance between memory and compute. They can do otherwise in a "low cost" model, they announced something like that in a partnership with Qualcomm that AFAIK has never been implemented. But it would not be a high-speed mode. 3. The OP is also incorrect on the costs. They pick these costs up from dated customer quotation seen online (in which the Cerebras has incentive to jack it up), but this is not how anything works commercially, and at that time Cerebras was at much smaller scale. But you wouldn't expect Feldman to tell you what their actual costs are. That would be nuts. My thinking is the number could be off by up to 80% by now assuming Cerebras was making progress in their cost curve and the original number had very high margins (which it must have).


Probably because they are loading the entire model into SRAM. Thats how they can achieve 1.5k tokens/s.


Congrats on Qwen3 launch, also ty for the exploration tier. Makes our life a lot easier.

Any plan/ETA on launching it's big-brother (Qwen3-code)?


In that case, mind providing a more appropriate ballpark?


So, does that mean that in general for the most modern high end LLM tools, to generate ~1500 tokens per seconds you need around $500k in hardware?

Checking: Anthropic charges $70 per 1 million output tokens. @1500 tokens per second that would be around 10 cents per second, or around $8k per day.

The $500k sounds about right then, unless I’m mistaken.


62 days to break even, that would be a great investment


I think you're missing an important aspect: how many users do you want to support?

> For comparison, you can buy a DGX B200 with 8x B200 Blackwell chips and 1.4TB of memory for around $500k. Two systems would give you 2.8TB memory which is enough for this.

That would be enough to support a single user. If you want to host a service that provides this to 10k users in parallel your cost per user scales linearly with the GPU costs you posted. But we don't know how many users a comparable wafer-scale deployment can scale to (aside from the fact that the costs you posted for that are disputed by users down the thread as well), so your comparison is kind of meaningless in that way, you're missing data.


> That would be enough to support a single user. If you want to host a service that provides this to 10k users in parallel your cost per user scales linearly with the GPU costs you posted.

No. Magic of batching allows you to handle multiple user requests in parallel using the same weights with little VRAM overhead per user.


Almost everyone runs LLM inference at fp8 - for all of the open models anyway. You only see performance drop off below fp8.


Isn’t usually mixed? I understood that Apple even uses fp1 or fp2 on their hardware embedded models they ship on their phones, but as far as I know it’s typically a whole bunch of different precisions.


Small bit of pedantry: While there are 1 and 2-bit quantized types used in some aggressive schemes, they aren't floating point so it's inaccurate to preface them with FP. They are int types.

The smallest real floating point type is FP4.

EDIT: Who knew that correctness is controversial. What a weird place HN has become.


I wonder if the fact that we use "floating point" is itself a bottleneck that can be improved.

Remembering my CS classes, storing an FP value requires the base and the exponent; that's a design decision. Also remembering some assembler classes, Int arithmetic is way faster than FP.

Could there be a better "representation " for the numbers needed in NN that would provide the accuracy of floating point but provide faster operations? (Maybe even allow to perform required operations as bitwise ops. Kind of like the left/right shifting to double/half ints. )


Sure, we have integers of many sizes, fixed point, and floating point, all of which are used in neural networks. Floating points are ideal when the scale of a value can vary tremendously, which is of obvious importance for gradient descent, and then after we can quantize to some fixed size.

A modern processor can do something similar to an integer bit shift about as quickly with a floating point, courtesy of FSCALE instructions and similes. Indeed, modern processors are extremely performant at floating point math.


A common FP4 layout is 1 sign bit, 3 exponent bits, 0 mantissa bits. There's just not that much difference in complexity between that and a 4-bit integer---the ALU can just be a simple table lookup, for both FP and integer.


Could there be a better

Yes. Look up “block floating point”.


Shit I'd love to do R&D on this.


You're assuming that the whole model has to be in SRAM.


The metric of run/not-run is too simplistic. You have to divide out the total throughout the system gives to all concurrent users (which we don't know). Like a golf-cart can get you from New York to LA same as a train, but the unit economics of the train are a lot more favorable, despite its increased cost. The minimum deployment scale is not irrelevant, it may make it infeasible to run an on-prem solution for most customers for eg, but if you are selling tokens via a big cloud API it doesn't really matter.


I agree there will be some breakthrough (maybe by Nvidia or maybe someone else) that allows these models to run insanely cheap and even locally on a laptop. I could see a hardware company coming out with some sort of specialized card that is just for consumer grade inference for common queries. That way the cloud can be used for sever side inference and training.


1500 tokens/s is 5.4 million per hour. According to the document it costs $1.20 x 5.4 = $6.48 per hour.

Which is not enough to even pay the interest on one $3m chip.

What am I missing here ?


Indeed, and even if the cost per wafer was 300K, since about say 20-50 wafers are needed, its still 6MM to 15MM for the system. So likely it would appear this is VC subsidized.


Exactly what I was thinking.

What sort of latency do you think one would get with 8x B200 Blackwell chips? Do you think 1500 tokens/sec would be achievable in that setup?


>Maybe hedge funds or some sort of financial markets?

I'd think that HFT is already mature and doesn't really benefit from this type of model.


True, but if the hardware could be “misused” for HFT, it’d be awesome.


>Maybe hedge funds or some sort of financial markets?

Definitely not hedge funds / quant funds.

You'd just buy a dgx


> We're still a few generations of hardware away I'm guessing.

I don't know; I think we could be running models "as good as" Claude Opus 4, a few years down the line, with a lot less hardware — perhaps even going backwards, with "better" later models fitting on smaller, older — maybe even consumer-level — GPUs.

Why do I say this? Because I get the distinct impression that "throwing more parameters at the problem" is the current batch of AI companies' version of "setting money on fire to scale." These companies are likely leaving huge amounts of (almost-lossless) optimization on the table, in the name of having a model now that can be sold at huge expense to those few customers who really want it and are willing to pay (think: intelligence agencies automating real-time continuous analysis of the conversations of people-of-interest). Having these "sloppy but powerful" models, also enables the startups themselves to make use of them in expensive one-time batch-processing passes, to e.g. clean and pluck outliers from their training datasets with ever-better accuracy. (Think of this as the AI version of "ETL data migration logic doesn't need to be particularly optimized; what's the difference between it running for 6 vs 8 hours, if we're only ever going to run it once? May as well code it in a high-level scripting language.")

But there are only so many of these high-value customers to compete over, and only so intelligent these models need to get before achieving perfect accuracy on training-set data-cleaning tasks can be reduced to "mere" context engineering / agentic cross-validation. At some point, an inflection point will be passed where the marginal revenue to be earned from cost-reduced volume sales outweighs the marginal revenue to be earned from enterprise sales.

And at that point, we'll likely start to see a huge shift in in-industry research in how these models are being architected and optimized.

No longer would AI companies set their goal in a new model generation first as purely optimizing for intelligence on various leaderboards (ala the 1980s HPC race, motivated by serving many of the same enterprise customers!), and then, leaderboard score in hand, go back and re-optimize to make the intelligent model spit tokens faster when run on distributed backplanes (metric: tokens per watt-second).

But instead, AI companies would likely move to a combined optimization goal of training models from scratch to retain high-fidelity intelligent inference capabilities on lower-cost substrates — while minimizing work done [because that's what OEMs running local versions of their models want] and therefore minimizing "useless motion" of semantically-meaningless tokens. (Implied metric: bits of Shannon informational content generated per (byte-of-ram x GPU FLOP x second)).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: