Coreboot and Skylake, Part 2: A Beautiful Game

userbinator · on Aug 30, 2017

I'm not surprised that going back to 3Gbps works; modern high frequency signaling is so close to the boundaries of what works that even things like the length of the SATA cable and how it's oriented, and what temperature components are at, can mean the difference between a working link and one that fails when the right sequence of bits gets sent. DTLE is a way of tuning the transmitter output so that when the signal gets received at the other end, it's as "clear" as possible.

Even early SDRAM controllers had similar tunable settings for clock delays and such, because different DIMMs may vary slightly --- upon POST, the BIOS would set all the settings to nominal values, then nudge each one in one direction while reading/writing pathological data until errors occurred; then nudge them in the other direction until errors occurred, and finally settle on the average of the two extremes. I suspect a similar process needs to be done here, and if you reverse-engineered the BIOS further you would find the algorithm to do it.

More interesting information on DTLE "discrete time linear equalization" here: http://cc.ee.ntu.edu.tw/~rbwu/rapid_content/course/highspeed...

Big kudos to the Purism folks for trying to figure these things out. And big fuckings to Intel for keeping this stuff NDA'd --- it only makes it harder for those trying to buy and use your products (I hope someone leaks it all eventually...)

exikyut · on Aug 30, 2017

> Even early SDRAM controllers had similar tunable settings for clock delays and such,

Define "early". F00F-era Pentium, 486, 8086...?

> upon POST, the BIOS would set all the settings to nominal values, then nudge each one in one direction while reading/writing pathological data until errors occurred; then nudge them in the other direction until errors occurred, and finally settle on the average of the two extremes.

Is this the seemingly-pointless "memory test" all computers do?

I always thought that just zeroed RAM. TIL it's doing similar things to modem line training. Wow.

Now I remember - I have an old 400MHz Celeron-based system I used years ago with an AMI BIOS that would occasionally recommend a different "RAS-to-CAS delay" on startup. I'd go and change it and reboot, and then a little while (days/weeks) later it would recommend a slightly different setting. It would always alternate between the two delays. I was pretty sure the memory in the system was on the way out and be sad whenever I saw the message, haha.

mindentropy · on Aug 30, 2017

> Is this the seemingly-pointless "memory test" all computers do?

I might not be sure what your parent is talking but it is called "DDR memory training and calibration". I used to struggle with this when I was bringing up LPDDR2 on an i.MX6 embedded board.

You can find more about it here -> https://github.com/librecore-org/librecore/wiki/Understandin...

exikyut · on Aug 30, 2017

Wow. Really, really impressive. How far back did BIOSes do this - has it always been done?(!)

I can completely understand it being a struggle now. At some point was I thinking of getting into tinkering with DDR3/4 on FPGAs (to play with some video capture ideas), but I'm beginning to not look forward to it... heheh

mindentropy · on Aug 31, 2017

I don't have advanced BIOS and most of the BIOS uses a safe set of values is my assumption.

However BIOSes catered to overclocking crowd has this functionality.

Ref1: http://www.overclock.net/t/1490835/the-gigabyte-z97x-overclo...

Ref2: http://www.tomshardware.co.uk/ram-overclocking-guide,review-...

Also remember that the training also depends on the environmental temperature.

> I can completely understand it being a struggle now. At some point was I thinking of getting into tinkering with DDR3/4 on FPGAs

Understanding of the different values for the registers which the tool recommends and validating the value is a real pain although some of the vendors such as Micron's datasheets explains things nicely.

Xilinx MIG has a really dense guide and you really need to understand your timing etc. It is really not for the faint hearted.

mindentropy · on Aug 31, 2017

In case of the i.MX6 SoC I do the training and calibration and the recommended register values obtained is provided to U-boot. So now when U-boot starts up it starts on the SoC inbuilt SRAM which is very less (a few kilobytes). Once the first stage U-boot is booted it sets up the memory controller registers and the boots using this RAM.

When the DDR/LPDDR training happens it is supposed to be left overnight for multiple iterations. Temperature also matters for proper values and hence one of the reasons the multiple iteration testing done overnight.

In my case I had some really incompetent board designers who were just blaming. Finally since there might have been layout issues I was forced to lower the clock speeds (halve the clocks) for the DDR interfacing.

breakingcups · on Sept 1, 2017

The original Xbox from Microsoft was basically a slightly modified x86 computer (complete with actually-usb-ports-with-a-proprietary-plug controller ports)[0], where the 64mb of DDR SDRAM was soldered directly onto the motherboard.

Now, it's widely known within the Xbox "scene" that the quality of the RAM fluctuates a lot between different machines. The speculation was that Microsoft bought the cheapest bottom-of-the-barrel RAM it could find in bulk, which meant they couldn't hardcode the memory timings.

Instead, on boot, it clocks the RAM at the highest speed and does a quick write-and-read test. If it fails, it clocks it down a step and repeats the test until it finds a stable frequency.

This explains why sometimes the same game would run perfectly on one Xbox while stuttering on the other one, even when swapping DVD drives.

[0]:https://en.wikipedia.org/wiki/Xbox_technical_specifications

khedoros1 · on Aug 30, 2017

> Is this the seemingly-pointless "memory test" all computers do?

I think that the classic case was that the computer wrote and read to/from every memory address to check that the memory reported present was actually there. I'm not positive, but I think that most machines these days do a cut-down version of that to save time.

I also think that auto-adjusting the timing is probably a separate process.

userbinator · on Aug 30, 2017

I have an old 400MHz Celeron-based system I used years ago with an AMI BIOS that would occasionally recommend a different "RAS-to-CAS delay" on startup.

As far as I know, that's roughly the era when this was introduced --- late Pentium/early Pentium II, mid to late-90s.

The "memory test" happens after the tuning (which in my experience only touches limited portions of the whole address space) but if you watch carefully and have multiple, slightly differing (or even failing) memory modules, you may see it repeat once or twice as it encounters an error and retries the tuning.

exikyut · on Aug 31, 2017

Interesting.

Regarding the testing only touching limited portions of the address space, that reminds me of my early 90s vintage 486 (which I still have! :D). It would make the PC speaker tick as it counted out RAM. The first few ticks would be a little slow, then the rest would run quicker. It took me a few years to realize that it was slowly counting out the first 640K, then racing to the rest of the installed 8MB.

I remember very well (this was the first computers I owned that I've been able to keep around... somewhere, lol) that there were exactly four "slow" ticks and then the rest were faster. I definitely have to find all the bits for that machine sometime - I just tried to reproduce the ticking sound with `beep` but failed, and Googling to figure out what frequency and duration the ticks might have had was perhaps predictably useless.

lanius · on Aug 29, 2017

I am continually amazed that computers manage to work at all.

userbinator · on Aug 30, 2017

Modern computers operate so close to the margins that analog effects are very much visible.

As Andrew "bunnie" Huang once said, "You are not storing data, you are storing probabilistic approximation of your data."[1]

That being said, analog effects have been noticeable at the software level for a while; here's another example: https://www.linusakesson.net/scene/safevsp/

[1] http://bunniefoo.com/bunnie/sdcard-30c3-pub.pdf

_pmf_ · on Aug 30, 2017

One one of our boards, we had intermittent USB failures (sometimes after 5 minutes, sometimes after 13 days of continuous operation) that were ultimately caused by slightly misaligned/misspaced USB D+/D- traces. We ended up having to force usage of USB full speed mode for USB high speed devices via a debug register (not unlike the article's case). Makes me feel slightly better that even people much more competent than I end up having to use such workarounds.

exikyut · on Aug 30, 2017

Could you describe the misalignment? That's interesting - Arduinos etc have made USB 2 pretty ubiquitous, so this could be useful knowledge.

(I'm tired and needed to re-google it, so for anyone else: Full = 1.1/12Mbps, High = 2.0/480Mbps.)

simcop2387 · on Aug 30, 2017

If I were to guess, given the tolerances in timing for USB 2.0, one line was slightly longer than the other on the board. That'd mean that the transition on D- would happen at a different time than the one on D+ causing corrupted data.

exikyut · on Aug 30, 2017

Ah, yikes. Duly noted.

exikyut · on Aug 30, 2017

I just finished reading through the VSP saga. That was very interesting.

kbenson · on Aug 30, 2017

Every time I've had to deal with troubleshooting weird firmware or hardware (and good luck sometimes knowing which it is), I've been continually amazed at the giant mess that's being pushed as commercial or "enterprise" equipment. It appears every product enters a special phase of a few weeks to a few months right before shipping where time constraints force whoever is responsible for the code to wildly shovel crap fixes into place until the hardware "works".

An alternate explanation is that most this code is written by electrical engineers that while smart, don't quite have enough experience managing larger code projects yet and are still prone to some of the newbie software developer mistakes, and probably without good mentoring on that front because those that have already weathered this storm and learned enough to sail through smoothly have either jumped ship for greener pastures or been promoted enough to be out of the trenches but not enough to have sway to fix the problem.

bravo22 · on Aug 30, 2017

It is definitely not lack of experience. Some of the best coders I've known have been hardcore low level EE guys. The issue is organizational priorities and available bandwidth. As soon as the product ships the core people are moved to the next project. Maintenance is done by a few guys whose job it is not to break anything.

If there is something they can't fix, it goes back to the core guys who are already busy working on the new project and can only dedicate so much time to addressing this issue.

jdright · on Aug 30, 2017

About the open phone: Knowing that KaKaRoTo is involved, I start top believe that this will not be vaporware. If I had enough money I would surely put some in this project. Every time somebody talked about open phone I just think "another failure", now I know it has some possibility to really happen. As you can see by his text, he is ready to solve this puzzle. Anyway, good job Youness, keep going strong.

thr0000waay · on Aug 29, 2017

Like this persons attitude towards failure. He is a leader. Keep calm and disassemble.

ezoe · on Aug 30, 2017

I was skeptical about those Purism products which shipped with proprietary BIOS and they claims they are working on coreboot and disable Intel ME in the future.

At least, they are trying to keep the promise of coreboot part(It's coreboot not libreboot or librecore so it must contains binary blob firmware... still better than nothing)

floatboth · on Aug 30, 2017

Of course they use binary blobs — no one reverse engineered and reimplemented Intel's FSP… I wonder if there are any attempts though?

bradfa · on Aug 30, 2017

You don't need to use Intel's FSP if Intel will tell you how to do what the FSP does. FSP doesn't do anything magical, it just happens to setup a whole lot of early boot things for you like SDRAM configuration, microcode loading (and FSP knows what instructions its not allowed to use before microcode gets loaded), and starting the ME.

For Intel Bay Trail parts (which Intel publicly says you have to use FSP for), I was once told by a BIOS/firmware vendor that they didn't use Intel's FSP because it was too slow and their customers wanted very fast boot times. No idea if they were pulling my leg or not, but Intel's FSP on Bay Trail wasn't exactly super speedy.

CodeTheInternet · on Aug 29, 2017

Love the dig at Rothfuss in the end notes! Release the damned book, Patrick!

ss248 · on Aug 30, 2017

Can someone explain this "Purism" thing? I understand that they are trying to build laptops with secure components.

But they are using Intel processors. What about Intel ME? There is some workaround or they just let it be?

mikeash · on Aug 30, 2017

Interesting question. They have a page about it here: https://puri.sm/learn/intel-me/

I think the summary is: it's hard but they're working on it, and they think they can disable it.

userbinator · on Aug 30, 2017

Very recently, another team claims to have managed to disable one version of ME completely: https://news.ycombinator.com/item?id=15116719

Perhaps they could work together on this.

squarefoot · on Aug 30, 2017

It would be nice having a centralized list of the hardware that uses "liberated" CPUs, just in case one pops in front of us at discounted prices, flea markets etc.

saintfiends · on Aug 30, 2017

I really like where this is going. I'm seriously considering their phone. Only doubt I have is not knowing what battery life would be like.

robert_foss · on Aug 29, 2017

What a nice post. Almost Dolphin project level. Very nice!

hna99 · on Aug 30, 2017

Just to clarify (as there's a few projects called "Dolphin"), you were talking about the Wii emulator?

agumonkey · on Aug 29, 2017

Reminds me of the playstation joystick bug.

boondaburrah · on Aug 29, 2017

Is this the controller bug where you have to drop the joystick polling rate while writing to the memory card or all goes to hell?

agumonkey · on Aug 30, 2017

Yes this one; the one that even the hardware designer refused to acknowledge until reality check hit.

colejohnson66 · on Aug 30, 2017

Story link: https://www.gamasutra.com/blogs/DaveBaggett/20131031/203788/...

Ar-Curunir · on Aug 30, 2017

Love the references to the KKC! Also great work on finding all this stuff out. Computers are so fragile!

wodenokoto · on Aug 30, 2017

A bit of a sidetrack, but has anybody bought their laptops? How are they? Particularly the trackpad.

elcritch · on Aug 30, 2017

Second this! Anyone tried the trackpad? I have hopes it's not completely as terrible as most PC's. Between that, and this (cool) work on open boot processes, it might make my next laptop after my macbook pro ages some more.

dmitrygr · on Aug 29, 2017

That was a very disappointing article. The issue was not solved, and the workaround has no reason to work other than it does, meaning that it could also stop working anytime.

i am not sure how anyone with a clear conscience could ship a device which works by accident

kbenson · on Aug 30, 2017

> i am not sure how anyone with a clear conscience could ship a device which works by accident

LOL, welcome to enterprise level "Ship now, the service contract gives us at least a few weeks to work out the kinks" hardware, where understanding how the whole system actually works is a luxury most the engineers don't even have.

Kudos to them for giving it a good shot and continuing to look. A lot of times, unless you have money or clout to throw at some of the component originators, you're out of luck if you want real answers. The few engineers that have them are probably so busy that to devote them to figuring out what's really going on isn't worthwhile for the companies without some higher level intervention..

agumonkey · on Aug 29, 2017

considering the complexity 1) I learned a lot 2) kudos for even reaching stability

I wish the team finds a proper solution