To someone who knows more about the subject - am I correct that actually encodin...

tom_mellior · on Dec 1, 2017

As others have said, yes, that's mostly the case.

Even parsing C and elaborating it to a correctly typed AST is not quite as simple as others are making it out to be, though. Getting all the implicit type conversions correct is not completely trivial, and is a popular source of bugs (see some examples in http://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf for instance).

There are also some annoying ambiguities in the grammar and in name resolution rules that mean that sometimes it's not easy to tell whether something is supposed to be a type name or a variable name: https://jhjourdan.mketjh.fr/pdf/jourdan2017simple.pdf

C is a "simple" language in many senses of the word, but it has a lot of complex details. It's fine to gloss over them when writing a compiler for a language very similar to C for learning purposes, but getting everything just right for actual C is tough.

bluetomcat · on Dec 1, 2017

Yes, stuff like the `a * b` ambiguity (declaration or expression?), typedef being just another storage class specifier (thus, `static int var` and `typedef int mytype` being the exact same syntactic forms), having complex rules for implicit promotions and conversions, the mess between signed and unsigned.

All of these design warts of C show up clearly when attempting to write a compiler and are not very obvious to most users of the language.

pjmlp · on Dec 1, 2017

Function pointer declarations is also a fun one, specially in a function declaration as return value and parameters.

userbinator · on Dec 1, 2017

Typedefs are a little trickier to handle (particularly the "when do they actually take effect" part), but "the `a * b` ambiguity" is not hard, because if 'a' is a typedef or qualifier/specifier or basically anything that a declaration can start with, it's a declaration.

C++, on the other hand, is definitely far harder to parse, especially if you include things like templates.

bluetomcat · on Dec 1, 2017

Lexing, parsing and generating the AST are comparatively straightforward and a very small part of a real optimising compiler.

The hard stuff is everything related to the code generation – register allocation, instruction selection and scheduling, dealing with any peculiarities of the target ISA and microarchitecture, etc. Your goal here is to preserve the semantics of the HLL constructs while emitting the optimal sequence of instructions for the CPU to execute.

barrkel · on Dec 1, 2017

It depends on your source and target. The two parts are practically different disciplines. If you target llvm or naive assembly as output, most of your work will be in translating semantics, and depending on the language complexity, there can be a lot to implement. Or not that much, in simple languages like C.

The back end is mostly about transforms that preserve meaning, with a long tail of attempts to prove things about the code to make things faster in certain cases.

You can also have languages with complex front ends that need lots of work, whether it's preserving runtime type information (a big serialisation job that's not strictly back end work), interesting type systems that also are trying to prove things about the code, features that require increasingly elaborate transformations before they can be handed off to any common back end (like closures, iterators, async, continuation passing style).

foxhill · on Dec 1, 2017

absolutely. that's why llvm exists - you write the front end (the comparitvely easy part), emit non-optimised llvm-ir - llvm does the hard optimisations.