How to write a (shell) lexer by hand

后端 未结 1 702
醉梦人生
醉梦人生 2021-02-04 18:28

I\'m working on a shell, a small bash-like shell, without scripting (if while ...) I have to make the lexer/parser (LL) by hand.

So the lexer will transform the command

1条回答
  •  -上瘾入骨i
    2021-02-04 19:06

    (This explains the idea hinted by Spudd86).

    You need to implement a finite state machine. There are the following states:

    • General state
    • Inside a file name
    • Inside the && token
    • Inside the || token

    For each state and next input character, you have to decide what is the next state, and whether to output a token. For example:

    • Current state: General; character: x => next state: inside-file-name
    • Current state: inside-file-name; character: space => next state: General; output the token
    • Current state: inside-file-name; character: & => next state: inside-&&; output the token
    • Current state: inside-&&; character: & => next state: General; output the token
    • Current state: inside-&&; character: x => next state: General; syntax error
    • ... (ad nauseum)

    It's much boring work to work out all the rules (the fun starts when you must debug the resulting code), so most people use code generators to do that.


    Edit: some code (sorry if the syntax is messed-up; i usually program in C++)

    enum state {
        STATE_GENERAL,
        STATE_IN_FILENAME,
        ...
    };
    
    // Many characters are treated the same (e.g. 'x' and 'y') - so use categories
    enum character_category
    {
        CHAR_GENERAL, // can appear in filenames
        CHAR_WHITESPACE = ' ',
        CHAR_AMPERSAND = '&',
        CHAR_PIPE = '|',
        CHAR_EOF = EOF,
        ...
    };
    
    character_category translate(int c)
    {
        switch (c) {
        case '&': return CHAR_AMPERSAND;
        case ' ': case '\t': case '\n': return CHAR_WHITESPACE;
        ...
        default: return CHAR_GENERAL;
        }
    }
    
    void do_stuff()
    {
        character_category cat;
        state current_state = STATE_GENERAL;
        state next_state;
        char token[100];
        char token_length = 0;
        do {
            int c = getchar();
            cat = translate(c);
            // The following implements a switch on 2 variables
            int selector = 1000 * current_state + cat;
    
            switch (selector)
            {
            case 1000 * STATE_GENERAL + CHAR_GENERAL:
                next_state = STATE_IN_FILENAME;
                token[token_length++] = c; // append a character to a filename token
                break;
    
            case 1000 * STATE_GENERAL + CHAR_WHITESPACE:
                next_state = STATE_GENERAL; // do nothing
                break;
    
            case 1000 * STATE_GENERAL + CHAR_PIPE:
                next_state = STATE_IN_OR_TOKEN; // the first char in '||' or just '|'
                break;
    
            // Much repetitive code already; define a macro for the case constants?
            // Have to cover all states and all character categories; good luck...
    
            case 1000 * STATE_IN_FILENAME + EOF:
            case 1000 * STATE_IN_FILENAME + CHAR_WHITESPACE:
                next_state = STATE_GENERAL;
                printf("Filename token: %s\n", token);
                break;
    
            default:
                printf("Bug\n"); // forgot one of the cases?
            }
    
            current_state = next_state;
    
        } while (cat != CHAR_EOF);
    }
    

    0 讨论(0)
提交回复
热议问题