Bootstrapping trust in compilers

jperkin · on Dec 8, 2015

Trust is certainly one of the issues when languages migrate to requiring themselves to bootstrap.

However, by far the most infuriating (and one I run into frequently in my line of work, hence the anger) is when you are trying to get the language running on a platform for which binary bootstraps do not yet exist.

Portability matters. If you want your language to be useful and available to as many people as possible, why would you seek to artificially limit the number of platforms it can be built on, just so you can avoid writing the bootstrap in C? I'm sure there is some amount of pride on the part of the language author when their language can bootstrap itself, but it certainly isn't a pragmatic decision.

It's especially frustrating when the bootstrap requirement itself changes so that only very recent versions of the language are sufficient (e.g. GHC), leaving the porter to have to reach back into the archives and carefully plot a path through building multiple versions from the original C-based bootstrap until they finally get to master.

This is painful, painful work, and then has to be done all over again for e.g. 32-bit vs 64-bit. It doesn't have to be like this.

iamsohungry · on Dec 8, 2015

> However, by far the most infuriating (and one I run into frequently in my line of work, hence the anger) is when you are trying to get the language running on a platform for which binary bootstraps do not yet exist.

> Portability matters. If you want your language to be useful and available to as many people as possible, why would you seek to artificially limit the number of platforms it can be built on, just so you can avoid writing the bootstrap in C? I'm sure there is some amount of pride on the part of the language author when their language can bootstrap itself, but it certainly isn't a pragmatic decision.

This problem is easily solved by having a rule that each new version of the compiler must compile in an older version of the compiler. The first few versions are written in C, and once the compiler is self-hosting, new versions of the compiler are compiled on older versions of the compiler. This gives you a path from C to the current version of the language.

In practice, this happens very naturally, because it's how compilers are usually written. Assuming you have version control and the first versions of the compiler are written in C, you usually have the ability to bootstrap up from C. The only thing missing in many projects is documentation and tooling for that process.

jdudek · on Dec 8, 2015

Pardon my ignorance, but why is cross-compilation not an option in this case?

jperkin · on Dec 8, 2015

Cross-compilation is sometimes an option, but not necessarily a good one.

It is usually a significantly manual process (I now have to download and install an OS which is supported by the runtime and build a cross-compiler for my target platform, which may or may not even be possible, or even work), isn't always supported by the runtime, usually requires changes to the build procedure, often requires special patches - and at the end of all that users are still left with either having to repeat all this themselves, or trust someone else's binary bootstrap.

Compared to a simple, portable, C-based bootstrap where any user can type 'make' and ensure provenance directly from source. For what gain?

Uninformed? Perhaps. I haven't designed and written my own language, and I'm certainly no expert. I'm just someone who gets asked to port these new languages to platforms which aren't Linux or OSX, and it's a lot of hard work. I just know that if I did it write my own it would look a lot more like perl/python/lua than ghc or go from a build perspective.

e40 · on Dec 8, 2015

It is and that's how it's done. The rant you replied to was just uninformed.

nickpsecurity · on Dec 8, 2015

Typically, yes. My method is for high trustworthiness but people merely concerned with reliability can do this. Got a tool that you've used to successfully make your other tools? Use it on the next tool. Crazy idea, eh?

nickpsecurity · on Dec 8, 2015

I described how to handle this and trustworthy compilers in general here:

https://news.ycombinator.com/item?id=10182282

My replies to "jeffreyrogers" have details and links to examples. You do it bottom-up. I disagree with using Forth as it's a weird language & that reduces number of people that will verify it. One would be better off with P-code given it was successfully used to get Pascal on 70 or so architectures. Wirth and Jurg later used the same approach in Lilith workstation with M-code and Modula-2. They were able to put together a CPU, high-level assembler (M-code), high-level language, compiler, OS, editor, and so on in around 2 years by keeping it simple and consistent. Something like that which maps to what people already know and do.

So, again, here's your model:

1. Portable stack or register VM that's ultra-simple plus similar to language targeting it.

2. Implementations of that diversified by authors, OS's, and HW.

3. Subset of language (or simple HLL like Modula-2) coded in whatever you need to get initial compiler working.

4. That same compiler re-coded in language of trusted VM and run on all targets to ensure same results (equivalence checks).

5. Use that binary to produce an executable from compiler's HLL source and equivalence check again.

Note: Did I word 5 less confusing than most people do at this point? I put effort into avoiding "compile the compiler with compiler etc." ;)

6. Use the binary from No 5 to compile future versions of the compiler written in a subset of its own language. Should continue using a subset for easier understanding and correctness. Check language features with testing suite and sample applications instead of with overly complicated compiler.

So, there you go. Easy stuff already proven by Wirth et al. Not worth another 100 write-ups. Just use what we know. The real problem worth lots of discussion and investigation is certified, secure/robust compilation. That is a difficult problem open to investigation with new, interesting results each year. Bootstrapping compilers for masses? That's so 1971. ;)

Jabbles · on Dec 8, 2015

I highly recommend this writeup of someone bootstrapping half a language from raw hexadecimal upwards:

http://homepage.ntlworld.com/edmund.grimley-evans/bcompiler....

civodul · on Dec 10, 2015

This is one of the discussions we had at the Reproducible Builds Summit last week: https://lists.gnu.org/archive/html/guix-devel/2015-12/msg001... .

In GNU Guix, we don't go as far as the author suggests (starting from a FORTH-like VM, then building a small Lisp, etc.), but we've been thinking about going in that direction: We already have Guile Scheme at the bottom, with which we can implement a range of tools, ranging from HTTP/FTP clients to ELF parsers, and more. We could imagine having (possibly feature-limited) variants of some of the bootstrap tools, written in Scheme, for the purpose of building the "real" tools.

Our current bootstrap looks like this: https://www.gnu.org/software/guix/manual/html_node/Bootstrap... .

OR13 · on Dec 8, 2015

Obligatory reference to previous HN discussion of Reflections on Trusting Trust by Ken Thompson [1].

I'd be interested in any war stories or links to compilers verified with things like: Cryptol [2], Coq [3] or Idris [4].

I've seen Cryptol prove equivalence for cryptographic algorithms written in C and Java. Would love to learn more about how this approach can or can't be applied to compilers.

1. https://news.ycombinator.com/item?id=2642486

2. http://www.cryptol.net/

3. https://coq.inria.fr/

4. http://www.idris-lang.org/

steveklabnik · on Dec 8, 2015

I enjoyed this slightly snarky response to this general issue on Reddit: https://www.reddit.com/r/rust/comments/2tdsev/compilers_with...

Regardless, this is one of the reasons that I'd really like to have a second Rust implementation exist.

xbtcdev · on Dec 8, 2015

The trouble with KTH is that somebody could go through all of this song and dance, and it still wouldn't mean a thing because how do you trust them?

rejschaap · on Dec 9, 2015

You don't trust them. The result of this endeavor is not a trustworthy compiler, the result is a procedure to generate one. Every step in the procedure can and should be verified independently. What this buys you is a procedure that produces a trustworthy compiler given your initial environment is trustworthy. The latter still being an issue of course.

Confiks · on Dec 8, 2015

Here is also a very interesting talk about reproducible builds and trusting compilers given at the Chaos Communication Congress: https://www.youtube.com/watch?v=5pAen7beYNc

They also talk about using multiple and very old compilers to bootstrap trust.

SamReidHughes · on Dec 8, 2015

A point made at https://news.ycombinator.com/item?id=6360232 is that it suffices to use an older compiler and system, assuming that the newer one was developed independently.

_lce0 · on Dec 8, 2015

Given that today hardware is cheap and fast, why are we still using compiled languages?

Plus using languages that requires source code at runtime help lower the barrier for newcomers.

dave2000 · on Dec 8, 2015

1) nobody cares about newcomers. fuck 'em 2) you're crazy if you think there's no advantage in compiling code. scripts are fast, compile code is faster. we love faster code, don't we kids? 3) if you wanna be a html hairdressing, sit and ponce about with some lovely, lovely html, this week's javascript framework and css. some of us are writing games for complicated consoles, (limited) android devices or emulating system calls in code converted from old, shitty languages into c++.

come back in 30 years.

paulddraper · on Dec 9, 2015

Well, ultimately, some programs must be native.

But beyond that, off the top of my head, I'd say the primary reasons:

* performance

* memory usage (think resource restricted environments)

* performance

* no extra runtime needed

* performance

* start up overhead

* predictable performance (think real-time stuff)

* access to ubiquitous native APIs

* performance

Look at programs like Nginx or PostgreSQL.

They make crazy effective performance optimizations that are really only available to native code. Of course, not every program is a web server or a database, but they are compiled for good reasons, and those reasons apply to many other programs.

rwallace · on Dec 8, 2015

By all means use Python if that's what you prefer! But it doesn't solve the problem discussed in the article: what if the Python binary has been compromised?

_lce0 · on Dec 8, 2015

Yep I agree. I was talking about compiled _high-level_ languages.