Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Some notes on high speed networking on PCs (nanog.org)
108 points by fanf2 on June 9, 2018 | hide | past | favorite | 70 comments


> the kernel community is working on AF_XDP

See recent Intel presentation on Time Sensitive Networking, https://schd.ws/hosted_files/elciotna18/b6/ELC-2018-USA-TSNo...


TSN is interesting for special use cases like the one they presented in the presentation. I've impression that those two are solving two different problems. TSN focus on deterministic, precise timing of packets and low latency while sacrificing throughput. AF_XDP focuses mainly on throughput.


Do you think we'll ever see 10G networking "just work" at home? I mean, I can plug in a couple of PCs into a relatively cheap switch with relatively cheap cabling and get somewhere approximating 1 gigabit right now.


Well, I didn't see any trouble with it. Chinese tplink switch starts at $200 (new) and has 4 sfp+ ports, plus a bunch of gbe copper. Network cards are $20-30 a pop on eBay (used mlnx), twinax DACs are $10, for longer distance transceivers ($10) and cables ($2 conn + $1/meter) are abundant also.

Newer had a trouble with this cheap setup. The only bottleneck is ssd speed in NAS.

anything above 10g might be a trouble, but I didn't try it.


I know what you mean, but it has gotten a lot cheaper. There are sub-$1K 10G switches and consumer-ish onboard-10G motherboards.

Probably, something popular needs building that demands 10G-in-the-home to use, in order to get volume up so pricing can come down a bit. VR, if it actually does hit this time, is a candidate.

As far as fragility, copper is the way to go at home, unless you're entirely pet and child-free or have your own cage in the garage.


> Probably, something popular needs building that demands 10G-in-the-home to use, in order to get volume up so pricing can come down a bit.

HDMI 2.0 for 4K is 14.4 Gbit/s

HDMI 2.1 for 8K is 42.7 Gbit/s


...how cheap is HDMI capture?

Receiving non-visual data from the GPU has the interesting property that the shaders and video RAM are generally kind of very close to the HDMI port, so I could potentially be "sending" the result of running a shader on data in video RAM.

Something something using VRAM for caching...?

I've been meaning to figure out how to capture HDMI, at line speed, for a while. My current thinking is to use a PCIe FPGA capable of bus-master DMA, a horribly-hacked-apart kernel that doesn't touch a few GB of RAM, and a kernel driver to "chase" after the gigantic ring buffer the FPGA copies into.

Hmm, I probably wouldn't be able to copy at HDMI 2.1 line rate.

Unless there are very very scary FPGA contraptions that use TWO PCI-e slots? :D:D [EDIT: I now realize this would need two PCI-e root complexes. Woops]


HDMI capture devices are still in the >$100 range, it seems. That usually runs it straight into an MPEG encoder though.

I believe any FPGA with a decent PCIe implementation should be able to bus-master, and if you set up the drivers properly you don't have to do anything with the kernel, it'll just allocate a block of transfer RAM for you.

You can use multiple PCIe lanes easily? If you need more than an x16 slot I'd be very surprised.


One PCIe3 x16 slot can handle about 120 Gbit/s full duplex.


> HDMI capture devices are still in the >$100 range, it seems. That usually runs it straight into an MPEG encoder though.

Right. I wouldn't be surprised if that encoder is a single chip designed to ingest TMDS so the 10-40Gb/s of raw data only travels a few mm.

And MPEG encoding is the only reasonable way to handle data of this size, considering that it compresses 10Gbps down to <80Mbps and typically <10Mbps.

> I believe any FPGA with a decent PCIe implementation should be able to bus-master, and if you set up the drivers properly you don't have to do anything with the kernel, it'll just allocate a block of transfer RAM for you.

As for the first part, cool.

With the 2nd part, you're seeing my complete lack of hardware knowledge :)

I was envisaging having the host system "ignore" several GB of RAM - have linux's memory manager simply not touch it - and then turning it into a gigabyte+-wide circular buffer to copy into.

Then, the processing/do-whatever code is written as a kernel module that "chases" after wherever the "head" is in the circular buffer (presumably this would be a pointer written to a fixed memory location). Because the buffer is gigabytes wide the kernel driver can stall (for whatever reason) for multiples of seconds and have wild swings in "chase performance" before there are any real issues.

This design enables an important goal which I wanted to implement: pressing PrtScr or hitting a hardware button flushes the last few seconds of whatever was on the screen to disk - ie, you can achieve hardware-level, pixel-perfect, video capture. Frame glitch? Saved. "LOOK, the GPU displayed a single frame wrong again, agh, you missed it"? Saved. Weird timing glitch in graphics stack you can't reproduce? Saved. Weird timing/race-condition-related graphics issues that only happen when your test code is removed and are too fast to diagnose? Saved. Suddenly you can just spray debug data into the corner of the screen and analyze the captures later.

Ideally, some simple compare-while-copying code running in the FPGA could store simple frame deltas and distil whole-screen updates into a list of changed rects. Maybe you could even do that on-CPU, as a kernel task.

But here's the thing. In the worst case, with no compression/reduction code present, you'd be able to store 1 minute of 60fps 1080p in 20.85GB of RAM, and 6 minutes in 125.15GB. 1 minute of 3840x2160 uses 83.42GB, 3 minutes uses 250.28GB. 1 minute of 8K takes 333GB :D

Of course, flushing to disk is a major operation; it would likely require multiple PCI NVMe devices, in either RAID or driven via a threaded storage engine, to flush the memory to storage fast enough - in the 8K example, you need to copy 333GB of RAM to disk in 1 minute, so your solution needs to receive 5GB (44.4Gb) per second, sustained, no performance spikes/dips. :P

(This is where the gigantic circular buffer comes in to play; the idea is that the wraparound hits RAM with new data just as the old data got saved.)

What's potentially interesting with this are the situations doing desktop-type work, small-scale tests, etc, where not all of the screen is going to be updated, and the reduction system will diff the updates very effectively. In these situations, if you're careful, you may go from minutes to tens-of-minutes or even hours of recording time, on a 128GB or 256GB RAM system. And that's only spending a few hundred $. (For the RAM, that is.)

If you're doing game development - every screen effectively new data - and you need pixel-perfect HDMI capture, well, 2TB RAM workstations are only low/mid-5 figures now...

> You can use multiple PCIe lanes easily? If you need more than an x16 slot I'd be very surprised.

The GP mentioned that HDMI 2.1 is 42.7Gbps. Elsewhere in this thread it's mentioned that PCI-e's limit is 50Gbps. [EDIT: Just noticed the other comment about PCI-e speed. This makes things a little easier!]


Have a look at https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt - "write into host buffer(s)" is a very common feature, and I don't see why you can't just define a giant buffer.


Can you expand on that last sentence? I don’t understand it but it sounds interesting.


Fiber optic cables are too fragile for home use. If bent to any extreme the cable will be ruined.

Hence only useful in homes with protected installations like a server enclosures.


Which is silly as most people don't have houses that need 200M runs much less the 50M that is possible with 10GbaseT over cat5e.

Why spend money on SFP's and optical cables, when you can just pick up cheap 10GbaseT hardware?


The 10GBASE-T doesn't support Cat5e cabling. It supports a distance of up to 55m on Cat6 and 100m on CAT6a.

10GBASE-T is showing up in more and more small business targeted equipment but I think more likely 2.5GBASE-T and 5GBASE-T will make it to consumer home equipment instead of 10GBASE-T.

2.5GBASE-T supports a distance of up to 100m on CAT5e. 5GBASE-T supports up to 100m on CAT6 and may operate on some Cat5e infrastructure. Additionally they both support PoE which 10GBASE-T does not.


The 10GbaseT switch I have (netgear) actually says it supports cat5e for 10G speeds up to 45M or some such (to lazy to dig up a manual for it). I was on the fence and prepared to upgrade some of my cabling when I purchased it, but decided to give the cat5 a try, and haven't regretted it. Given it has link quality, and mangled packet counters and if there were actually any issues with the runs in my house I would be able to see it in the link stats..

The cable modem OTOH... (a few bad packets a day).

Edit: Also for most purposes 'modern' 5e (which tends to be more an ad-hoc standard due to people exceeding the base requirements) is basically as good as cat 6, its only with 6A do you get the longer runs with are a result of stricter termination rules, larger gauge cable, better cross pair (foil) shielding, and a bunch of other rules.

As far as 2.5/5G, I was really annoyed that it wasn't just an optional part of the 10G standard. The idea being that vendors sell 10G switches and if the cables arn't up to the task to degrade to something slower. Initially the big vendor pushing seemed to be trying to market segment 2.5/5 and 10G, but more recently it seems sanity has prevailed and more recent switches are doing exactly that. You get 10G if the switching thinks the cable run is good, otherwise it drops to 1/2.5/5.


10GBase-T certainly can work on cat5e, I run it over 10-20m at home.


Yes, it can work but it isn't a configuration supported by the standard. The original question was about when home 10GBASE-T network equipment is going to be generally available for the home market. Manufacturers are not going to market consumer equipment they know will likely not work on a large percentage of home networks. They will instead more likely market 2.5GBASE-T and 5GBASE-T which based upon the standards should work in most people's home networks.

You can certainly buy 10GBASE-T equipment meant for small business and give it a go on your network. If you are lucky you may have success.


Works != supports


Can you explain why you need that much official blessing from the manufacturer to use it? If it doesn't work, you send it back?


Well, if you bought something that says on the box that it doesn't do X, you generally don't get to send it back because it doesn't do X. Some consumer outlets may let you do this as a courtesy, but it's not ubiquitous and certainly not if you're buying from trade suppliers.

Also, for a 10-20m run, the price difference is literally in the single figures. The effort alone to send something back is (should be) worth more to you than that.


I live in an apartment that came pre-wired. I’ve done this in 3 apartment buildings, up to 30m or so. I tested one with a Fluke 10GBase-T tester and it passed.


> Some consumer outlets may let you do this as a courtesy, but it's not ubiquitous

That depends on the country. E.g. in the EU you can return goods bought online within 14 days (for any reason).


Unused, in original complete packaging. And you have to pay for shipping.


No, but with no more wear than through the use that what would be accepted when examining in a retail store. And shipping inside e.g. Germany is like 5 EUR (about 5$ if you compare pre-tax USD prices with post-tax EUR prices).


For the lengths in use in home installations ( n < 1km ) plastic cables with a much greater modulus of elasticity can be used since the greater signal attenuation won't be a factor.


Are you saying there are plastic-based alternatives to optical fiber cables? Wow.

What are these actually called, and what do I look for?


Polymer Optical Fiber (POF). Very common and essentially the standard in non-telco/non-IT use of fiber data networks (process control, vehicles, etc). I’ve seen studies of using them for gigabit or better computer networking, but I don’t think it is common yet.

What people mostly mean here is multimode vs single mode glass fiber, though.


A company called Fuba makes Gigabit POF products which got quite favorable reviews. They are "media converter" type products for single runs though and not cheap, so no GBICs or switches with many ports.


That's a myth. Fiberoptic cables are actually really robust.


Well POF is, real glass still has bend diameter limitations, but most people aren't using single mode longer fiber runs that need glass.

So, yes, all the common stuff is pretty flexible, because its plastic.


Yes and no. Even singlemode can be pretty flexible. Every manufacturer's product is different, but even 20 years ago, the singlemode we were installing could be wrapped around a Sharpie without damage (though it would lose a lot of light; that was an installation-only ability), and would happily tolerate something like twice that bend radius during operation.

That being said, I've seen people do some horrific things with copper, and it tends to survive things it shouldn't. But if you treat glass even half as gently as copper wants to be treated, they'll both be fine.

I think fiber gets a bad rep because large multi-strand cables have stiffeners to keep the jacket from kinking, and people assume the individual strands are similarly stiff. They really aren't.


Just a couple of weeks ago[1]. The fiber wasn't run properly and got pinched between some innerduct and the conduit edge.

Had 20+db of loss while it was pinched, but after unpinching it and massaging the kink out, it went right back to normal.

[1] https://imgur.com/a/bHX6eRJ


Do you mean 100G? 10G has just worked for many years, as long as you have relatively recent cabling (cat6). There's also 5G now if you have old cabling (https://arstechnica.com/gadgets/2016/09/5gbps-ethernet-stand...)

Depends on what your idea of relatively cheap is of course - low end 10gbase-t switches cost about $200-$500 (cheapest ones have just 2 fast ports and rest are 1G). NICs are about $100.


It's already here. I own a switch that has 10GB SFP+ and I dump two NAS devices into it using twinax DAC (fused) 10Gb cables. They're cheap and easy. As for 10Gb copper PHY those are still a bit pricey. But fiber SFP and twinax DAC that are 10Gb are very reasonable. All of the supporting cards are as well. If you have a need for it - I wouldn't consider it unattainable today.


You can buy intel 2x 40gbe pci-e NICs for about $300-400 on Amazon/eBay. You can get a refurb HP enterprise full 40gbe x 24-48 port switch for about 2-3k. I would say it’s coming sooner or later, as over the top it may

https://www.amazon.com/dp/B00OGR1TQW


and ebay is dirty with 10G SFP+ mellanox cards, many under $20. 24 port 10G SFP+ switches can be found for $300, and they all work fine with the cheap SFP+ modules from fs.com


Within context it might be confusing for someone that isn't familiar with the field.

Networking has been measured in 'bits' (per second), even when modern packets* (within the data payload anyway) are always octet (8 bits per unit) based.

The speed quoted is always in "marketing" (human engineering notation) units, not in binary units (even though we presently only hook up things that count in binary).

A stupidly cut-rate device might use 100mbit Ethernet: ~11MBytes/sec (base 2).

A reasonable expectation is 1000mbit (gitabit) Ethernet: ~119MBytes/sec (base 2).

Most /spinning/ media disks now transfer faster than gigabit Ethernet (for large file transfers); SSDs obliterate this.

10Gbit networking would eliminate a bottle-neck for even 'lightly' raided SSDs and would be ideal for interlink between switches.

According to the Wikipedia article ( https://en.wikipedia.org/wiki/10_Gigabit_Ethernet ) 10GBASE-T (50-100 meters over copper; assuming high grade) first appeared as a standard in 2006. As a conservative estimate, the patents involved will expire in 2025. THAT is probably when we'll really see 10G take off. A bunch of the fiber modes appeared in 2002 though, so in higher end fiber gear we might see those appearing more commonly next year.


Last time I looked there were no fanless 10Gb switches, but its definitely doable. With copper, and the motherboards with built in adapters are cheaper.


my asus XG-U2008 doesn't seem to have a fan in it, but it sure warms up. netgear's GS110MX also claims to be fanless.


Oh interesting, thanks, the previous Netgear ones were not. Yeah, I imagine it produces a fair bit of heat.

Oh, but these only have two 10Gb ports, I was looking at 8 port 10Gb switches eg Netgear XS508M which still has a fan. It peaks at 40W.


Don't think so. It's too expensive to manufacture and too hard to use. Copper cables are expensive and easily confused with the cheaper cables that don't work. Optics are very expensive and fragile.

And there is no usage for that of course. Home devices cannot handle 10gb/s of incoming data.

It gets worse. Users have massively abandoned Ethernet in favor of Wifi. Many devices are manufactured without Ethernet support.


There is not much fragility to e.g. Corning's zero bending loss single mode fiber. Sure, i'ts easier to bend, so just add like a corrugated spring steel tube around the fibers themselves, and cover that with something that prevents corrosion and such. And 10G SFP+ modules are not that fragile either. Actually, last I remember, for home distances (30-80m) you get cheaper with fiber than copper, just due to the very short reach fiber transceivers being low-tech and the corresponding fiber cables cheaper than highly shielded twister pair.


An Ethernet cable doesn't break when it's bend or when it's accidentally rolled over with chair wheels.


If you make the fiber cable as thick as an Ethernet cable, you can easily make it not break when bend or rolled over.


There’s too much black magic getting drivers configured to run at 10G, currently (at least with Intel networking hardware, in my experience).

Edit: would the one who downvoted me like to explain where I’m mistaken? If there’s a good resource I’d like to see it.


1Gbit? That sounds so ~2005 when getting more than 3Gbit was a bit of a pain. These days bog standard intel nic's without tuning, and I can pull 4Gbit over SMB off my C2000 based NAS. In just transfer benchmarks I can get really close to line rate between two machines (of course that is with multiple streams) with little or no tuning.

Sounds like there is something wrong with your config, or your using ebay adapters that are 15+ years old. All the modern adapters have multiqueue, TSO, etc and it really helps...


Not without a major technological advancement in cable termination. For 1 Gbps, as long as all eight wires are connected to the jack in the right order, it usually works. For 10 Gbps, as I understand it, a certain amount of competence is required to get a reliable connection.


I've found the opposite. Some cables I made where I just made sure the ends matched worked at 100 mbit, but would ping on GigE, but nothing longer. Took me awhile before I figured out it was the cables.


Back when I looked into high-speed networking in small spaces, my idea was doing it over PCI directly bypassing Ethernet. I really wanted to do something like SGI's NUMAlink but PCI be cheaper with commodity parts available. You'd need a switch for it, though. I ended up finding a company selling them.

So, there's always that kind of thing to consider. I wonder if 10-100Gbps have better price/performance than TCP/IP over PCI by now. I cant even rember the company or tech's name.


I wanted to do the same thing with PCI!! I thought the bandwidth was large enough that it should be a great direct interface. Unfortunately synchronizing between two pci masters was going to be tricky and the high frequency development required for PCIe requires some pricey diagnostic equipment. I ran across those off the shelf switches / interfaces, but they were a lot more expensive that 10gb ethernet at the time and they weren't even available for high lane count pcie 2 and pcie 3.

I think now that 10gb Ethernet prices are coming down to about $100 per card and switch port there's no way specialized direct pcie would compete off the shelf.

Still I think it would be cool to have a direct 32 lane pcie 3 of 4. It could operate as fast as ram speeds. If you connected two 2P beasts it would be close to having a 4P.

Of course there would probably be protocol issues and overhead as the no-two-masters issue would mean you can't use the native protocol directly.


Don't worry, the chip you are searching is branded ExpressFabric and sold/made by Broadcom. It's AFAIK only PCIe4, and you get ones with up to iirc 96 lanes per chip. They support to be connected into a mesh, even adding shared PCIe devices like network cards, GPUs, NVMe and SAS/SATA HBAs is possible, though you might have difficulties with the GPUs in particular. The others, if you are careful when buying, should do well if they can handle SR-IOV, as (from their view), the mesh switch they are attached to is the root, and the computers attached to the mesh take the place a VM would normally occupy.

If you could get the smallest chip they make, it should not be hard to get a PCB made for it that fans out to e.g. a set of USB-C receptacles. One such receptacle can handle 2 PCIe3 lanes, and some power/USB-2 on the side. The main reason being the relatively low price, compared to e.g. mini SAS HD, which has a similar density and twice the lanes per connector, but also a slightly higher target impedance at the upper end of the PCIe spec, whereas USB-C is specifically targeting the perfect PCIe impedance.


A quick search on Digikey shows several entries for ExpressFabric which seem like what you mean:

https://www.digikey.com/products/en?keywords=expressfabric

They all have a part status of "Discontinued at Digi-Key" though. Not seeing any further info of where they _are_ available either. :/


You found the chips I was referring to. If you want them, you should probably ask Broadcom, as they should either be able to point you to where you can still buy them, or to a replacement. And in case they don't help you, you can still ask sales at competitors, as some might have something suitable. This certainly looks like they'd do a run for you if you want some, albeit with a more significant MOQ (probably in increments of one wafer, with the best price to you if you agree to take all good dies of those they make for you). According to their website [0], they are still active products. It just seems that no one really wants to buy them, due them being somewhat weird. There is a reference platform [1], which is a 32-port, as far as I can tell PCIe3 x4 on QSFP ports, 1U TOR switch. It goes with 2-port cards that go into the servers and contain re-timers. Broadcom claims one can connect the switch with the server cards using optics or copper, but the switch is $11k at mouser. The technology apparently offers virtual Ethernet NICs and 8 QOS classes.

Did you want to buy such technology?

[0]: https://www.broadcom.com/products/pcie-switches-bridges/expr... [1]: https://www.broadcom.com/applications/datacenter-networking/... (bottom of the page, grep "ExpressFabric Reference Platform")


> Did you want to buy such technology?

Nope, just merely curious

For most of the use cases I can think of that these would suit, Infiniband seems like it would also work and that's fairly widely available already.

That being said, there's probably use cases this would suit better. They're just not coming to mind easily. :)


Cheaper than Infiniband, I assume. Even if you buy the TOR switch, from what I can tell.

Also, the latency is much smaller than for Infiniband, and it is nice to e.g. combine blades with NICs or similar configurations, without the NICs being special multi-host ones like Mellanox offers.


I didnt think about the specialized, diagnostic gear. That couldve priced me out.

Far as your other idea, a Distributed, Shared Memory setup was one of my potential applications. I was looking at Beowulf clusters and old MPP systems trying to find ways to get more CPU's with large RAM and simplified programming. I found quite a few software implementations of DSM model. Infiniband and Myrinet were main thing in networking. So, PCI idea was partly about connecting them with something that already had DMA. That at least led me to NUMAscale's products for scaling AMD systems.


Note this comment in this subthread: https://news.ycombinator.com/item?id=17276861


That's possible with PLX/Broadcom ExpressFabric but those switches have fallen far behind Ethernet. The fastest PCIe switch is ~768 Gbps while ~6500 Gbps Ethernet switches are available.


You need to consider that these PCIe switches you talk about are (afaik, at least there are big ones) single-die, compared to the probably even multi-board construction of the Ethernet switches. They do support and are in at least one document advertised to be used in a switching fabric, where you create a mesh between the nodes you try to connect. Basically, instead of plugging every computer into the same PCIe switch, you fill the PCIe switches half full with computers, maybe even a little less/add some that don't connect directly to computers, and add some interconnections between the switches. The mathematics that describe how one would want to arrange the small switches date back at least to when automated telephone switches became common. A similar version, though with separated inputs and outputs, is known as a Crossbar Switch.

Did you by chance come across something smaller than a full-blown, PCIe3x4 per node, TOR switch? I don't need that much capacity, but a cheap enough high speed switch would allow using some slower (individually) servers that got kicked out from the previous owner, instead of getting new, larger machines. A certain level of microservice-thinking/scale of (distributed) Haskell/Erlang does often require much faster connections than one would like to afford with Ethernet, to keep the coding overhead for the sharding to a minimum. At a certain point one can stop worrying about minimizing transfers, and just rely on page-faulting and similar things for less-critical (in a performance sense) data accesses.


I was talking about single-chip switches. The thing about Clos is that it has non-linear scaling, so if you use a chip with lower capacity you end up needing far more chips.

All of these chips have smaller sizes available.


There was a product in late nineties that did P2P links over ATA66/100 IDE interface. No drivers needed, just write/read to a virtual filesystem on simulated drive.


10gb on VDI is fun when your storage can absorb line rate spikes.

Things break in new ways and are fun to troubleshoot.


This sounds kind of interesting. What happened?


The constraints moved!

We had a write intensive workload for a user community that would generate short (~5s) periods of high traffic (6-9Gb) against a file server — this would impact everyone because sessions would be interrupted. It was devilishly difficult to troubleshoot because it was lost in the statistics, aggregated in 30-100s chunks.

Fixing that revealed another constraint where we hit a constraint with the file server process itself.

It’s one of these things that was interesting because the storage with fancy SSD can handle the workload no sweat. In the old days, monitoring would catch the io constraint first.


Hmm. This highlights the importance of statistics AND insane threshold edge detection. Good one to file away, thanks.


Yeah, we’re working on a way to grab real-time stats during peak load conditions to help correct for outliers lost in averages.

Another key thing is to be aware where heavy write load generators are on SMB, and scope the SMB shares around them to scope failure domains. There are limits, and the vendors (Microsoft or 3rd party NAS) have a hard time finding them too.


Well, that was... bellicose. No company is your friend, but some are still better than others. It is true that companies work in their own self-interests, but those interests may differ. Apple has done quite well without selling user data, and has cultivate a brand that makes money doing something else. Facebook has not. The enemy of my enemy is my friend, even if they work in their own self-interest. It is in the self-interest of apple to keep customers and they might lose more than they gain selling user info. Again, no company is perfect. "Applying constant pressure" isn't perfect either, as governments are at least as bad as any company - their motivation is to stay in office. Apple is at least a better company then, say, Facebook.


Um, wrong topic?

This one is about "Some notes on high speed networking on PCs" - not sure how Apple comes into it?


Looks like a comment destined for https://news.ycombinator.com/item?id=17276184




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: