Compiling an AST to Assembly

后端 未结 1 912
情深已故
情深已故 2021-02-04 19:01

I have an abstract syntax tree that I need to convert to assembly for a virtual machine. I don\'t know how best to do this so I started using a chain of string templates. Pseudo

相关标签:
1条回答
  • 2021-02-04 19:42

    Disclaimer: I only have experience with X86 machine code. Other instruction sets may have, for example, different addressing capabilities, so parts of my advice might not apply. I'm sorry that I don't have time to research instruction sets at the moment.


    Well firstly, most compilers don't generate assembly as text, because it's kinda inefficient to serialise the code into assembly only to have it parsed straight away by the assembler, as you have probably realised. It is reasonable to have separate compilation and assembly phases, but not essential.

    In the compilation phase, the two strategies I would consider are:

    • (a) generate the assembly as a tree / array of instruction objects which can symbolically refer to one another. In the assembly phase these need to be serialised into bytecode/machinecode. I'd recommend this method, even if it makes your compiler's architecture a little more complex.

    • (b) generate the assembly as machinecode/bytecode into a buffer, with some helper functions; in this case you don't really have a separate assembly phase. I've personally tried this method, and within the scope of a single function it's not bad, but may cause some extra difficulties by not knowing how large a function is going to be before it's assembled.

    I would guess that (a) is the approach used by optimising compilers like GCC, while (b) is the approach used by high-speed compilers like TCC.


    Let's again consider the if example by examining the code that an existing compiler generates for a simple if/else branch:

    Note the overlapping jumps in the disassembly - one that skips the 'taken' block and one that skips the 'not-taken' block.

    These are relative jumps, so in order to assemble them we need to know how many bytes of instructions are between the jump instruction and the destination.

    Here's an example of what the compilation function might look like using strategy (a):

    Instruction[] compile_if(IfNode n) {
        Instruction[] code;
    
        code ~= compile_condition(n.condition);
    
        Instruction skip_taken = new JumpInstruction(`jz`);
        code ~= skip_taken;
    
        code ~= compile_block(n.taken_block);
    
        Instruction skip_nottaken = new JumpInstruction(`jmp`);
        code ~= skip_nottaken;
    
        Instruction[] nottaken_code = compile_block(n.nottaken_block);
        skip_taken.destination = nottaken_code[0];
        code ~= nottaken_code;
    
        Instruction end = new NopInstruction();
        skip_nottaken.destination = end;
        code ~= end;
    
        return code;
    };
    

    This should be pretty self-explanatory.

    Note how instructions refer to one another symbolically (skip_taken.destination = nottaken_code[0]), rather than by byte-offsets like in serialised machinecode. We leave those offset calculations for the assembler.

    Also note how we set the destinations of the JumpInstructions only once they become available.

    The NopInstruction at the end is just to give the skip_nottaken jump something to refer to.

    Now, how do we actually assemble these jumps into real machinecode/bytecode? here's one possibility (a very basic example):

    byte[2] assemble_jz(Instruction[] code, int idx) {
        // assemble the jz instruction at code[idx]
    
        JumpInstruction jump = code[idx];
        ++idx;
    
        byte jump_offset = 0;
        while (code[idx] != jump.destination) {
            jump_offset += size_of_instruction(code[idx]);
            ++idx;
        };
    
        byte[2] machinecode = [
            0x74, // jz short
            jump_offset
        ];
        return machinecode;
    };
    

    Because the assembler has access to all the instruction objects, it can calculate the actual offsets for relative jumps by scanning ahead until it finds the destination instruction.


    I hope this brief introduction helps you get started with designing your own compiler backend. Obviously I'm not suggesting that you write your compiler exactly like my example, but it should give you some ideas of how to approach the general problem of compiling and assembling non-linear instruction blocks.

    You might also want to take a look at some existing assembler APIs such as https://github.com/asmjit/asmjit .

    Good luck.

    0 讨论(0)
提交回复
热议问题