问题
I'm trying to write parser for c++ style header file and failing to properly configure the parser.
Lexer:
lexer grammar HeaderLexer;
SectionLineComment
: LINE_COMMENT_SIGN Section CharacterSequence
;
Pragma
: POUND 'pragma'
;
Section
: AT_SIGN 'section'
;
Define
: POUND 'define'
| LINE_COMMENT_SIGN POUND 'define'
;
Booleanliteral
: False
| True
;
QuotedCharacterSequence
: '"' .*? '"'
;
ArraySequence
: '{' .*? '}'
| '[' .*? ']'
;
IntNumber
: Digit+
;
DoubleNumber
: Digit+ POINT Digit+
| ZERO POINT Digit+
;
CharacterSequence
: Text+
;
Identifier
: [a-zA-Z_0-9]+
;
BlockComment
: '/**' .*? '*/'
;
LineComment
: LINE_COMMENT_SIGN ~[\r\n]*
;
EmptyLineComment
: LINE_COMMENT_SIGN -> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
WhiteSpace
: [ \r\n\t]+ -> skip;
fragment POUND : '#';
fragment AT_SIGN : '@';
fragment LINE_COMMENT_SIGN : '//';
fragment POINT : '.';
fragment ZERO : '0';
fragment Digit
: [0-9]
;
fragment Text
: [a-zA-Z0-9.]
;
fragment False
: 'false'
;
fragment True
: 'true'
;
Parser:
parser grammar HeaderParser;
options { tokenVocab=HeaderLexer; }
compilationUnit: statement* EOF;
statement
: comment? pragmaDirective
| comment? defineDirective
| section
| comment
;
pragmaDirective
: Pragma CharacterSequence
;
defineDirective
: Define Identifier Booleanliteral LineComment?
| Define Identifier DoubleNumber LineComment?
| Define Identifier IntNumber LineComment?
| Define Identifier CharacterSequence LineComment?
| Define Identifier QuotedCharacterSequence LineComment?
| Define Identifier ArraySequence LineComment?
| Define Identifier
;
section: SectionLineComment;
comment
: BlockComment
| LineComment+
;
Text to parse:
/**
* BLOCK COMMENT
*/
#pragma once
/**
* BLOCK COMMENT
*/
#define CONFIGURATION_H_VERSION 12345
#define IDENTIFIER abcd
#define IDENTIFIER_1 abcd
#define IDENTIFIER_1 abcd.dd
#define IDENTIFIER_2 true // Line
#define IDENTIFIER_20 {ONE, TWO} // Line
#define IDENTIFIER_20_30 { 1, 2, 3, 4 }
#define IDENTIFIER_20_30_A [ 1, 2, 3, 4 ]
#define DEFAULT_A 10.0
//================================================================
//============================= INFO =============================
//================================================================
/**
* SEPARATE BLOCK COMMENT
*/
//==================================================================
//============================= INFO ===============================
//==================================================================
// Line 1
// Line 2
//
// @section test
// Line 3
#define IDENTIFIER_TWO "(ONE, TWO, THREE)" // Line 4
//#define IDENTIFIER_3 Version.h // Line 5
// Line 6
#define IDENTIFIER_THREE
With this configuration I have few issues:
- The parser could not correctly parse "#define IDENTIFIER abcd" on line 11
- "// @section test" on line 36 is parsed as a line comment, but I need to parse this as a separate token
- Parsing of the commented define directive is not working "//#define IDENTIFIER_3 Version.h // Line 5"
回答1:
Whenever there are problems when parsing, you should check what kind of tokens the lexer is producing.
Here are the tokens that you lexer produces:
BlockComment `/**\n * BLOCK COMMENT\n */`
Pragma `#pragma`
CharacterSequence `once`
BlockComment `/**\n * BLOCK COMMENT\n */`
Define `#define`
Identifier `CONFIGURATION_H_VERSION`
IntNumber `12345`
Define `#define`
CharacterSequence `IDENTIFIER`
CharacterSequence `abcd`
Define `#define`
Identifier `IDENTIFIER_1`
CharacterSequence `abcd`
Define `#define`
Identifier `IDENTIFIER_1`
CharacterSequence `abcd.dd`
Define `#define`
Identifier `IDENTIFIER_2`
Booleanliteral `true`
LineComment `// Line`
Define `#define`
Identifier `IDENTIFIER_20`
ArraySequence `{ONE, TWO}`
LineComment `// Line`
Define `#define`
Identifier `IDENTIFIER_20_30`
ArraySequence `{ 1, 2, 3, 4 }`
Define `#define`
Identifier `IDENTIFIER_20_30_A`
ArraySequence `[ 1, 2, 3, 4 ]`
Define `#define`
Identifier `DEFAULT_A`
DoubleNumber `10.0`
LineComment `//================================================================`
LineComment `//============================= INFO =============================`
LineComment `//================================================================`
BlockComment `/**\n * SEPARATE BLOCK COMMENT\n */`
LineComment `//==================================================================`
LineComment `//============================= INFO ===============================`
LineComment `//==================================================================`
LineComment `// Line 1`
LineComment `// Line 2`
LineComment `//`
LineComment `// @section test`
LineComment `// Line 3`
Define `#define`
Identifier `IDENTIFIER_TWO`
QuotedCharacterSequence `"(ONE, TWO, THREE)"`
LineComment `// Line 4`
LineComment `//#define IDENTIFIER_3 Version.h // Line 5`
LineComment `// Line 6`
Define `#define`
Identifier `IDENTIFIER_THREE`
As you can see in the list above, #define IDENTIFIER abcd
is not being parsed properly because it produces the following tokens:
Define `#define`
CharacterSequence `IDENTIFIER`
CharacterSequence `abcd`
and can therefor not match the parser rule:
defineDirective
: ...
| Define Identifier CharacterSequence LineComment?
| ...
;
As you can see, the lexer operates independently from the parser. No matter if the parser tries to match an Identifier
for the text "IDENTIFIER"
, the lexer will always produce a CharacterSequence
token for this.
The lexer creates tokens based on only 2 rules:
- try to match as much characters as possible
- if 2 (or more) lexer rules can match the same characters, the rule defined first "wins"
Because of the rules mentioned above, //#define IDENTIFIER_3 Version.h // Line 5
is tokenised as a LineComment
(rule 1 applies: match as much as possible). And input like once
is tokenised as a CharacterSequence
and not as a Identifier
(rule 2 applies: CharacterSequence
is defined before Identifier
)
To have #define
be treated the same in and outside a comment, you could use lexical modes. Whenever the lexer sees a //
, it goes into a special comment-mode, and once in this comment mode, you will also recognise #define
and @section
tokens. You leace this mode when seeing one of these tokens (or when you see a line break, of course).
A quick demo of how that could look like:
lexer grammar HeaderLexer;
SPACES : [ \r\n\t]+ -> skip;
COMMENT_START : '//' -> pushMode(COMMENT_MODE);
PRAGMA : '#pragma';
SECTION : '@section';
DEFINE : '#define';
BOOLEAN_LITERAL : 'true' | 'false';
STRING : '"' .*? '"';
IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT : '/**' .*? '*/';
OTHER : .;
NUMBER : [0-9]+ ('.' [0-9]+)?;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE : '{' .*? '}' | '[' .*? ']';
mode COMMENT_MODE;
// If we match one of the followinf 3 rules, leave this comment mode
COMMENT_MODE_DEFINE : '#define' -> type(DEFINE), popMode;
COMMENT_MODE_SECTION : '@section' -> type(SECTION), popMode;
COMMENT_MODE_LINE_BREAK : [\r\n]+ -> skip, popMode;
// If none of the 3 rules above matched, consume a single
// character (which is part of the comment)
COMMENT_MODE_PART : ~[\r\n];
and a parser could then look like this:
parser grammar HeaderParser;
options { tokenVocab=HeaderLexer; }
compilationUnit
: statement* EOF
;
statement
: comment? pragmaDirective
| comment? defineDirective
| sectionLineComment
| comment
;
pragmaDirective
: PRAGMA char_sequence
;
defineDirective
: DEFINE IDENTIFIER BOOLEAN_LITERAL line_comment?
| DEFINE IDENTIFIER NUMBER line_comment?
| DEFINE IDENTIFIER char_sequence line_comment?
| DEFINE IDENTIFIER STRING line_comment?
| DEFINE IDENTIFIER ARRAY_SEQUENCE line_comment?
| DEFINE IDENTIFIER
;
sectionLineComment
: COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
;
comment
: BLOCK_COMMENT
| line_comment
;
line_comment
: COMMENT_START COMMENT_MODE_PART*
;
char_sequence
: CHAR_SEQUENCE
| IDENTIFIER
;
来源:https://stackoverflow.com/questions/62637114/antlr4-parser-issues