Hacker Newsnew | past | comments | ask | show | jobs | submit | charleshn's commentslogin

They should be reintroducing the 3D vcache [0] variants (X) in EPYC, with a higher cache/core ratio, that was present in EPYC4 (e.g. 9684X [1]) they for some reason wasn't available in EPYC5.

Makes a massive difference at high density and utilisation, with the standard cache/core performance can really degrade under load.

[0] https://www.amd.com/en/products/processors/technologies/3d-v...

[1] https://www.amd.com/en/products/processors/server/epyc/4th-g...


It's fundamentally because of verifier's law [0].

Current AI, and in particular RL-based, is already or will soon achieve super human performance on problems that can be - quickly - verified and measured.

So maths, algorithms, etc and well defined bugs fall into that category.

However architectural decision, design, long-term planning where there is little data, no model allowing synthetic data generation, and long iteration cycles are not so much amenable to it.

[0] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...


> std::hardware_destructive_interference_size Exists so you don't have to guess, although in practice it'll basically always be 64.

Unfortunately it's not quite true, do to e.g. spacial prefetching [0]. See e.g. Folly's definition [1].

[0] https://community.intel.com/t5/Intel-Moderncode-for-Parallel...

[1] https://github.com/facebook/folly/blob/d2e6fe65dfd6b30a9d504...


> There's a good reason so much research is done on Nvidia clusters and not TPU clusters.

You are aware that Gemini was trained on TPU, and that most research at Deepmind is done on TPU?


I can relate.

I had tinnitus for over 10 years. My tinnitus was not the usual ringing type, it was some sort of humming, low frequency noise. The frequency was not constant, it could vary. It could sometimes stop for 5-10 minutes, e.g. after a hot bath.

Went to see many specialists, tried everything, to no avail.

One day I started experiencing recurrent tension and light pain in my neck and shoulder blades, so I started doing some neck and shoulder blades stretches several times a day.

After a few weeks, the pain was gone, and I realised the tinnitus had stopped. This was maybe 2 years ago (I am still doing those exercises multiple times a day).


A few questions if the authors are around!

> Is hardware agnostic and uses TCP/IP to communicate.

So no RDMA? It's very hard to make effective use of modern NVMe drives bandwidth over TCP/IP.

> A logical shard is further split into five physical instances, one leader and four followers, in a typical distributed consensus setup. The distributed consensus engine is provided by a purpose-built Raft-like implementation, which we call LogsDB

Raft-like, so not Raft, a custom algorithm? Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?

> Read/write access to the block service is provided using a simple TCP API currently implemented by a Go process. This process is hardware agnostic and uses the Go standard library to read and write blocks to a conventional local file system. We originally planned to rewrite the Go process in C++, and possibly write to block devices directly, but the idiomatic Go implementation has proven performant enough for our needs so far.

The document mentions it's designed to reach TB/s though. Which means that for an IO intensive workload, one would end up wasting a lot of drive bandwidth, and require a huge number of nodes.

Modern parallel filesystems can reach 80-90GB/s per node, using RDMA, DPDK etc.

> This is in contrast to protocols like NFS, whereby each connection is very stateful, holding resources such as open files, locks, and so on.

This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).

No mention of the way this was developed and tested - does it use some formal methods, simulator, chaos engineering etc?


> So no RDMA?

We can saturate the network interfaces of our flash boxes with our very simple Go block server, because it uses sendfile under the hood. It would be easy to switch to RDMA (it’s just a transport layer change) but right now we didn’t need to. We’ve had to make some difficult prioritisation decisions here.

PRs welcome!

> Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?

We’re used to building things like this, trading systems are giant distributed systems with shared state operating at millions of updates per second. We also cheated, right now there is no automatic failover enabled. Failures are rare and we will only enable that post-Jepsen.

If we used somebody else’s implementation we would never be able to do the multi-master stuff that we need to equalise latency for non-primary regions.

> This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).

Even NFSv3 needs a duplicate request cache because requests are not idempotent. Idempotency of all requests is hard to achieve but rewarding.


Not to mention you simply want a large distributed system implemented in multiple clouds / on prems / use cases, with battle tested procedures on node failure, replacement, expansion, contraction, backup/restore, repair/verification, install guides, an error "zoo".

Not to mention a Jepsen test suite, detailed CAP tradeoff explanation, etc.

There's a reason those big DFS at the FAANGs aren't really implemented anywhere else: they NEED the original authors with a big, deeply experienced infrastructure/team in house.


DeepSeek team, which is also an HFT shop, also implemented their DFS - https://github.com/deepseek-ai/3FS


My memories are a bit sketchy, but isn't CAP worked around by the eventual consistency of Paxos/Raft/...?


The protocols you mentioned are always consistent. You will know if they are not consistent because they will not make progress. Yes there's a short delay where some nodes haven't learned about the new thing yet and only know that they're about to learn the new thing, but that's not what's meant by "eventual consistency", which is when inconsistent things may happen and become consistent at some time later. In Paxos or Raft, nodes that know the new consistent data is about to arrive can wait for it and present the illusion of a completely consistent system (as long as the network isn't partitioned so the data eventually arrives). These protocols are slow. so they're usually only used for the most important coordination, like knowing which servers are online.

CAP cannot be worked around. In the event of a partition, your system is either C or A, no buts. Either the losing side of the partition refuses to process writes and usually reads as well (ensuring consistency and ensuring unavailability) or it does not refuse (ensuring availability and ensuring data corruption). There are no third options.

Well, some people say the third option is to just make sure the network is 100% reliable and a partition never occurs. That's laughable.


Yes, so when I see a distributed system that does not tell me explicitly whether or not it is sacrificing consistency or availability, I get suspicious.

Or has mechanisms for tuning on a request basis what you want to prioritize: consistency or availability, and those depend on specific mechanisms for reads and writes.

If I don't see a distributed system that explains such things, then I'm assuming that they made a lot of bad assumptions.


> Yes there's a short delay where some nodes haven't learned about the new thing yet, but that's not what's meant by "eventual consistency", which is when inconsistent things may happen and become consistent at some time later.

Thanks, I haven't looked at these problems in a while.

> In the event of a partition, your system is either C or A, no buts.

Fair enough. Raft and Paxos provide well-understood tradeoffs but not a workaround.


Out of curiosity, you seem knowledgeable here, is it possible to do NVME over RDMA in public cloud (e.g., on AWS)? I was recently looking into this and my conclusion was no, but I'd love to be wrong :)


Amazon FSx for Lustre is the product. They do have information on DIY with the underlying tech: https://aws.amazon.com/blogs/hpc/scaling-a-read-intensive-lo...


Thanks for the link! I had seen this, but it wasn't clear to me either how to configure the host as an nvme-of target, nor whether it would actually bypass the host CPU. The article (admittedly now 4 years old) cites single digit GB/second, while I was really hoping for something closer to the full NVME bandwidth. Maybe that's just a reflection of the time though, drives have gotten a lot faster since then.

Edit: this is more like what I was hoping for: https://aws.amazon.com/blogs/aws/amazon-fsx-for-lustre-unloc... although I wasn't looking for a file system product. Ideally a tutorial like... "Create a couple VMs, store a file on one, do XYZ, and then read it from another with this API" was what I was hoping for, or at least some first party documentation of how to use these things together.



So... The issue is that I'm not using lustre. As far as I can tell, NVME over fabrics (nvme-of) for RDMA is implemented by kernel modules nvmet-rdma and nvme-rdma (the first being for the target). This kernel modules supports infibiband and I think fiber channels, but _not_ EFA, and EFA itself is not an implementation of infiniband. There are user space libraries that paper over these differences when using them for just network transport (E.g., libfabrics) and EFA sorta pretends to be IB, but afaict this is just meant to ease integration at the user space level. Unfortunately, since there's no kernel module support for EFA in the nvme-of kernel modules, it doesn't seem possible to use without lustre. I don't know exactly how they're doing it for lustre clients. There seems to be a lustre client kernel module though, so my guess is that it's in there? The lustre networking module, lnet, does have an EFA integration, but it seems to only be as a network transit. I don't see anything in lustre about nvme-of though, so I'm not sure.

Maybe there's something I'm missing though, and it'll just work if I give it try :)


Yeah, Lustre supports EFA as a network transit between a Lustre client and a Lustre server. It's lnet/klnds/kefalnd/ in the Lustre tree. But Lustre doesn't support NVMeoF directly. It uses a custom protocol. And neither does EFA. Someone would have to modify the NVMeoF RDMA target/host drivers to support it. EFA already supports in-kernel IB clients (that's how Lustre uses EFA today). So it's not an impossible task. It's just that no one has done it.


Hey, thanks for the comment! Also, I'm amused by the specificity of your account haha, do you have something set to monitor HN for mentions of Lustre?

> "But Lustre doesn't support NVMeoF directly. It uses a custom protocol."

Could you link me to this? I searched the lustre repo for nvme and didn't see anything that looked promising, but would be curious to read how this works.

> "And neither does EFA. Someone would have to modify the NVMeoF RDMA target/host drivers to support it."

To confirm, you're saying there'd need to be something like an EFA equivalent to https://kernel.googlesource.com/pub/scm/linux/kernel/git/tor... (and corresponding initiator code)?

> "EFA already supports in-kernel IB clients (that's how Lustre uses EFA today). So it's not an impossible task. It's just that no one has done it."

I think you're saying there's already in-kernel code for interfacing with EFA, because this is how lnet uses EFA? Is that https://kernel.googlesource.com/pub/scm/linux/kernel/git/tyc...? I found this but I wasn't sure if this was actually the dataplane (for lack of a better word) part of things, from what I read it sounded like most of the dataplane was implemented in userspace as a part of libfabric, but it sounds like I might be wrong.

Does this mean you can generally just pretend that EFA is a normal IB interface and have things work out? If that's the case, why doesn't NVME-of just support it naturally? Just trying to figure out how these things fit together, I appreciate your time!

In case you're curious, I have a stateful service that has an NVME backed cache over object storage and I've been wondering what it would take to make it so that we could run some proxy services that can directly read from that cache to scale out the read throughput from an instance.


> do you have something set to monitor HN for mentions of Lustre?

Nothing, beside browsing hackernews a bit too much.

> "But Lustre doesn't support NVMeoF directly. It uses a custom protocol."

To be specific, Lustre is a parallel filesystem. Think of it like a bigger version of NFS. You format the NVMe as ext4 or ZFS and mount them as Lustre servers. Once you have an MGS, MDS, and OSS - you can mount the servers as a filesystem. Lustre won't export the NVMe to client as a block device. But you could mount individual Lustre files as a block device, if you want.

> To confirm, you're saying there'd need to be something like an EFA equivalent to https://kernel.googlesource.com/pub/scm/linux/kernel/git/tor... (and corresponding initiator code)?

Essentially, yeah.

> I think you're saying there's already in-kernel code for interfacing with EFA, because this is how lnet uses EFA?

Yes. EFA implements kernel verbs support. Normal user-space applications use user verbs i.e. https://www.kernel.org/doc/html/latest/infiniband/user_verbs.... Kernel verbs support allows kernel-space applications to also use EFA. This is currently implemented in the out-of-tree version of the EFA driver https://github.com/amzn/amzn-drivers/tree/master/kernel/linu.... Lustre interfaces with that with the driver in lnet/klnds/efalnd/. NVMeoF would need some similar glue code.

> Does this mean you can generally just pretend that EFA is a normal IB interface and have things work out? If that's the case, why doesn't NVME-of just support it naturally? Just trying to figure out how these things fit together, I appreciate your time!

There are some similarities (the EFA driver is implemented in the IB subsystem, after all). But the semantics for adding/removing ports/interfaces would be different - so it wouldn't "just work" without some changes. I don't know the full scope of the changes (I haven't dived into it too much). Although, I suspect support would look fairly similar to drivers/nvme/target/rdma.c.

> In case you're curious, I have a stateful service that has an NVME backed cache over object storage and I've been wondering what it would take to make it so that we could run some proxy services that can directly read from that cache to scale out the read throughput from an instance.

If you're looking for a scale out cache in front of s3, that's essentially Lustre/s3 integration https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-dr.... It's a filesystem, so I guess it depends on how your service expects access objects.


Interesting that neither the article nor the comments mention the CALM theorem [0], which gives a framework to explain when coordination-free consistency is possible, and is arguably the big idea behind SEC.

[0] https://arxiv.org/abs/1901.01930


> We cannot add more compute to a given compute budget C without increasing data D to maintain the relationship. > We must either (1) discover new architectures with different scaling laws, and/or (2) compute new synthetic data that can contribute to learning (akin to dreams).

Of course we can, this is a non issue.

See e.g. AlphaZero [0] that's 8 years old at this point, and any modern RL training using synthetic data, e.g. DeepSeek-R1-Zero [1].

[0] https://en.m.wikipedia.org/wiki/AlphaZero

[1] https://arxiv.org/abs/2501.12948


AlphaZero trained itself through chess games that it played with itself. Chess positions have something very close to an objective truth about the evaluation, the rules are clear and bounded. Winning is measurable. How do you achieve this for a language model?

Yes, distillation is a thing but that is more about compression and filtering. Distillation does not produce new data in the same way that chess games produce new positions.


You can have a look at the DeepSeek paper, in particular section "2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Mode".

But generally the idea is that it's, you need some notion of reward, verifiers etc.

Works really well for maths, algorithms, amd many things actually.

See also this very short essay/introduction: https://www.jasonwei.net/blog/asymmetry-of-verification-and-...

That's why we have IMO gold level models now, and I'm pretty confident we'll have superhuman mathematics, algorithmic etc models before long.

Now domains which are very hard to verify - think e.g. theoretical physics etc - that's another story.


> But generally the idea is that it's, you need some notion of reward, verifiers etc.

i dont think youre getting the point hes making.


Synthetic data is already widely used to do training in the programming and mathematics domains where automated verification is possible. Here is an example of an open source verified reasoning synthetic dataset https://www.primeintellect.ai/blog/synthetic-1


Are they actually producing new data though? This is the sort of thing I called "compression and filtering" because it seems to be new information content is not being produced, but LLMs are used to distill the information we already have. We need more raw information.


Yes this is new synthetic data which did not exist before. I encourage you to read the link.


I think we're talking past each other, I'll try once more. Suppose you train an LLM on a very small corpus of data, such as all the content of the library of congress. Then you have that LLM author new works. Then you train a new LLM on the original corpus plus this new material. Do you really think you've addressed the core issue in the SP? Can more parameters be meaningfully trained even if you add more GPU?

To me, the answer is clearly no. There is no new information content in the generated data. Its just a remix of what already exists.


When it comes to logical reasoning, the difficulty isn't about having enough new information, but about ensuring the LLMs capture the right information. The problem LLMs have with learning logical reasoning from standard training is that they learn spurious relationships between the context and the next token, undermining its ability to learn fully general logical reasoning. Synthetic data helps because spurious associations are undermined by the randomness inherent in the synthetic data, forcing the model to find the right generic reasoning steps.


I agree! DeepSeek has shown this is incredibly powerful. I think their Qwen 8B model may be as good as GPT4’s flagship. And I can run it on my laptop if it’s not on my lap. But the amount of synthetic data you can generate is bounded by the raw information, so I don’t think it’s an answer to the SP.


Yes if you have some way to verify the quality of the new works and you only include the high quality works in the new LLM's training set.


But you don't have a way to do that at scale, other than feed it to another LLM that is trained on that exact same limited corpus. There is no new information being added into the system in loops like that. New information means new measurements, new proofs, new signal or media streams from cameras, new curation/rating data, new books or papers etc.


Simple, you just need to turn language into a game.

You make models talk to each other, create puzzles for each other's to solve, ask each other to make cases and evaluate how well they were made.

Will some of it look like ramblings of pre-scientific philosophers? (or modern ones because philosophy never progressed after science left it in the dust)

Sure! But human culture was once there too. And we pulled ourselves out of this nonsense by the bootstraps. We didn't need to be exposed to 3 alien internet's with higher truth.

It's really a miracle that AIs got as much as they did from purely human generated mostly garbage we cared to write down.


I feel like you’re glossing over some very thorny details that it’s not obvious we can solve. For example, if you just get two LLMs setting each other puzzles and scoring the others solutions how do you stop this just collapsing into nonsense? I.e. where does the source of actual truth come from for the puzzles?


> I feel like you’re glossing over some very thorny details that it’s not obvious we can solve.

Yeah. I tried to be funny. It's not that easy. However AI people already started doing it and AI gains perhaps of the last year come mostly from this approach.

> For example, if you just get two LLMs setting each other puzzles and scoring the others solutions how do you stop this just collapsing into nonsense?

That's the trillion dollar question. I wonder how people are doing it. Maybe through economy? You ultimately need to sell your ramblings to somebody to sustain yourself. If you can't, you starve.

Maybe that's enough for AI as well? Companies with AIs that descended into nonsense won't have anymore money to train them further. Maybe companies will need to set up their internal ecosystems of competing AI training organizations and split the budget based on how useful they are becoming?

Phrasing this in a terminology of "truth" is probably counterproductive because there's no truth. There's only what sells. If you have customers in manufacturing probably things that sell will coincide with some physical truths, but this is emergent, not the goal or even part of the process or acquiring capabilities.


>And we pulled ourselves out of this nonsense by the bootstraps.

Human progress was promoted by having to interact with a physical world that anchored our ramblings and gave us a reward function for coherence and cooperation. LLMs would need some analogous anchoring for it to progress beyond incoherent babble.


True, but LLMs got anchored to reality because we are using them in real world tasks and this connection will only grow richer, wider and faster.


> Of course we can, ... synthetic data ...

That's option (2) in the parent comment: synthetic data.


Yes, 450GB/s is the per GPU bandwidth in the nvlink domain. 3.2Tbps is the per-host bandwidth in the scale out IB/Ethernet domain.


I believe this is correct. For an H100, the 4 NVLink switches each have 64 ports supporting 25GB/s each, and each GPU uses a total of 18 ports. This gives us 450GB/s bandwidth within the node. But once you start trying to leave the node, you're limited by the per-node InfiniBand cabling, which only gives you 400GB/s out of the entire node (50GB / GPU).


Is it GBps (gigabytes per second) or Gbps (giga bits per second)? I see mixed usage in this comment thread so I’m left wondering what it actually is.

The article is consistent and uses Gigabytes.


GBps


Could you check the value of your kernel's net.ipv4.tcp_slow_start_after_idle sysctl, and if it's non zero set it to 0?


That seems to work, thank you!

Now latency is just RTT + server time + payloadsize/bandwidth, not multiple times RTT: https://github.com/grpc/grpc-go/issues/8436#issuecomment-311...

I was not aware of this setting, it's pretty unfortunate this is a system-level setting that can't be overridden on application layer, and the idle timeout can't be changed either. Will have to figure out how to safely make this change on the k8s service this is affecting...


As you can imagine, when a TCP connection is first established, it has no knowledge of the conditions on the network. Thus we have slow start. At the same time, when a TCP connection goes idle, it's information about the conditions on the network become increasingly stale. Thus we have slow start after idle. In the Linux stack at least, being idle longer than the RTT (perhaps the computed RTO) is interpreted as meaning the TCP connection's idea of network conditions is no longer valid.

An application won't know anything about background specifics of the network to which the system on which it is running is attached. A system administrator might. In that sense at least, it is reasonable that it is a system tunable rather than a per-connection setsockopt().


This sounds exactly like the culprit. I didn't knew there is a slow start after idle and it is set to 1 (active) by default.

I wonder if I should change this to 0 on my default desktop machines for all connections.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: