I'll assume you're in the same position as me: you want to write a compiler for fun, and to learn at least a little about each stage of it. So you don't want merely to write a plugin for an existing compiler. And you want to avoid using too many existing compiler modules, except where you can understand exactly what they're doing. In my case I am using bison
, which is a slight exception because it is doing at least a few things that I'm taking for granted (I did study grammars, etc. at university, but that was a long time ago). On the other hand, parser generators are common enough that it is a compiler stage worthy of interest: bison
may stop me writing much parsing code but it is giving me a change to write parser action code.
Contrary to some advice, I'd say you can get started without knowing everything about your input and target languages. With some exceptions, language features are not unfeasibly hard to add later. One exception I've discovered is control-flow: if you write most of the later manipulations to work on a tree form, it can be difficult to cater for statements like break
, continue
, and goto
(even the structured form). So I'd recommend translating from tree to CFG before doing too much of that.
- Write a parser for some reasonably stable subset of the input.
- Add actions that build a useful in-memory representation of it (typically a tree), and get it to print that.
- Get it to print it in a form that looks a bit like the target language. In my case I print the tree node for "x = y + z;" nodes as "ADD x, y, z"; "if (c) { ... }" turns into "bz c label1", then the translation of "..." then "label1:".
- Add optional stages in the middle. These can be optimisation and/or checking stages. You might need one that prepares the representation for easy code generation: I've got a stage that reduces overly complex expressions by adding temporary variables. (This is actually necessary for the output, because the "ADD" instruction can only work on simple inputs.)
- Go back and improve any part of it. E.g. put some checks in the parser actions so errors can be detected at that stage (use of undeclared variables, for instance).
It is surprisingly easy to get most of this done, if you take an iterative approach.