Output of Lexer

后端 未结 3 1174
隐瞒了意图╮
隐瞒了意图╮ 2021-01-15 11:30

I am currently writing a compiler and I\'m in the Lexer phase.

I know that the lexer tokenizes the input stream.

However, consider the following stream:

相关标签:
3条回答
  • 2021-01-15 11:54

    In general, your lexer should produce a stream of structs that contain language elements: operators, identifiers, keywords, comments, etc. These structs should be marked with type of the lexeme, and carry content relevant to the type of lexeme it represents.

    To enable good error reporting, it is good if each lexeme carries information about starting line and column, endline line and column (some lexemes span multiple lines), and the originating source file (sometimes a parser has to handle included files as well as the main file).

    For those language elements that contain variable content (numbers, identifiers, etc.), the struct should contain the variable content.

    For compiling or program analysis, the lexer can throw whitespace and comments away. If you intend to parse/modify the code, you'll need to capture comments.

    An example output can be instructive. For a variant of OP's example:

    /* My test file */
    
    int foo
        = 0; // a declaration
    

    ... DMS's C front end produces the following lexemes (this is a debug output, really handy to have when designing a complex lexer):

    C:\DMS\Domains\C\GCC4\Tools\Lexer\Source>run ../domainlexer C:\temp\test.c
    Lexer Stream Display 1.5.1
    Using encoding Unicode-UTF-8?ANSI +CRLF +1 /^I
    !! Lexer:ResetLexicalModeStack
    !! after Lexer:PushLexicalMode:
    Lexical Mode Stack:
    1 C
    File "C:/temp/test.c", line 1: /* My test file */
    File "C:/temp/test.c", line 2:
    File "C:/temp/test.c", line 3: int foo
    !! Lexer:GotoLexicalMode 2 CMain
    !! Lexeme @ Line 3 Col 1 ELine 3 ECol 4 Token 23: 'int' [VOID]=0000
      <<< PreComments:
    Comment 1 Type 1 Line 1 Column 1 `/* My test file */'
    !! Lexeme @ Line 3 Col 4 ELine 3 ECol 5 Token 2: whitespace [VOID]=0000
    !! Lexeme @ Line 3 Col 5 ELine 3 ECol 8 Token 210: IDENTIFIER [STRING]=`foo'
    File "C:/temp/test.c", line 4:     = 0; // a declaration
    !! Lexer:GotoLexicalMode 1 C
    !! Lexeme @ Line 3 Col 8 ELine 4 ECol 5 Token 2: whitespace [VOID]=0000
    !! Lexer:GotoLexicalMode 2 CMain
    !! Lexeme @ Line 4 Col 5 ELine 4 ECol 6 Token 113: '=' [VOID]=0000
    !! Lexeme @ Line 4 Col 6 ELine 4 ECol 7 Token 2: whitespace [VOID]=0000
    !! Lexeme @ Line 4 Col 7 ELine 4 ECol 8 Token 138: INT_LITERAL [NATURAL]=0
    File "C:/temp/test.c", line 5:
    !! Lexeme @ Line 4 Col 8 ELine 4 ECol 9 Token 98: ';' [VOID]=0000
      >>> PostComments:
    Comment 1 Type 2 Line 4 Column 10 `// a declaration'
    File "C:/temp/test.c", line 5:
    File "C:/temp/test.c", line 6:
    File "C:/temp/test.c", line 7:
    !! Lexer:GotoLexicalMode 1 C
    !! Lexeme @ Line 4 Col 26 ELine 7 ECol 1 Token 2: whitespace [VOID]=0000
    !! Lexeme @ Line 7 Col 1 ELine 7 ECol 1 Token 4: end_of_input_stream [VOID]=0000
    !! Lexer:GotoLexicalMode 2 CMain
    !! Lexeme @ Line 7 Col 1 ELine 7 ECol 1 Token 0: EndOfFile
    11 lexemes processed.
    0 lexical errors detected.
    
    C:\DMS\Domains\C\GCC4\Tools\Lexer\Source>
    

    The main output are lines marked !!, each of which represents the contents of a lexeme struct produced by the lexer. Each lexeme carries:

    • source file location information (for the main file, "test.c" in this case, that is not printed to make the debug output a bit more readable)
    • a "token number" (lexeme type) and the human-readable token name (makes debugging a lot easier)
    • the type of value carried by the token: [VOID] means "none", [STRING] means the token carries a string values, [NATURAL] means it carries an integral value, etc.
    • precomments: Comments preceding the token. This is unusual for classic lexers, but necessary if one is trying to transform source code. You can't lose the comments! Note the precomment is attached to a token; because comments are not semantically meaningful, one can argue where they should be placed. This is our particular choice.
    • postcomment: Comments that follow the token that belong to it.

    The last "token" EndOfFile is implicit defined in every DMS lexer.

    This debug trace also notes transitions of the lexer across lexical modes (many lexer generators have multiple modes in which they lex various parts of a language). It shows source lines as they are read.

    0 讨论(0)
  • 2021-01-15 12:00

    There is no real gain to have "letter" as an intermediate step - instead "foo" should probably be an identifier. Otherwise you could understand int as "letter letter letter" too, which doesn't make much sense.

    0 讨论(0)
  • 2021-01-15 12:16

    There is no simple answer for the general case.

    Usually it is easier to have the lexer identify "higher level" elements like identifier or even type or variable if the grammar of the languages allows to. The more dynamic the grammar is and interpretation of tokens depends more on internal state if the parser then it might be easier to pose the interpretation onto the parser. Otherwise the communication between lexer and parser might get overly complex. (E.g. consider a languate where int is a type in one location and a valid variable name in another and a language keyword in a third case)

    As a rule of thumb: let the lexer do all the work that keeps the grammer easy without causing extra complexity between lexer and parser.

    0 讨论(0)
提交回复
热议问题