ANTLR island grammars and a non-greedy rule that consumes too much

问题

I'm having a problem with an island grammar and a non-greedy rule used to consume "everything except what I want".

Desired outcome:

My input file is a C header file, containing function declarations along with typedefs, structs, comments, and preprocessor definitions. My desired output is parsing and subsequent transformation of function declarations ONLY. I would like to ignore everything else.

Setup and what I've tried:

The header file I'm attempting to lex and parse is very uniform and consistent. Every function declaration is preceded by a linkage macro PK_linkage_m and all functions return the same type PK_ERROR_code_t, ex:

PK_linkage_m PK_ERROR_code_t PK_function(...);

These tokens don't appear anywhere other than at the start of a function declaration.

I have approached this as an island grammar, that is, function declarations in a sea of text. I have tried to use the linkage token PK_linkage_m to indicate the end of the "TEXT" and the PK_ERROR_code_t token as the start of the function declaration.

Observed problem:

While lexing and parsing a single function declaration works, it fails when I have more than one function declaration in a file. The token stream shows that "everything + all function declarations + PK_ERROR_code_t of last function declaration " are consumed as text, and then only the last function declaration in the file is correctly parsed.

My one line summary is: My non-greedy grammar rule to consume everything before the PK_ERROR_code_t is consuming too much.

What I perhaps incorrectly believe is the solution:

Fix my lexer non-greedy rule somehow so that it consumes everything until it finds the PK_linkage_m token. My non-greedy rule appears to be consume too much.

What I haven't tried:

As this is my first ANTLR project, and my first language parsing project in a very long time, I'd be more than happy to rewrite it if I'm wrong and getting wronger. I was considering using line terminators to skip everything that doesnt start with newline, but I'm not sure how to make that work and not sure how it's fundamentally different.

Here is my lexer file KernelLexer.g4:

lexer grammar KernelLexer;
// lexer should ignore everything except function declarations
// parser should never see tokens that are irrelevant

@lexer::members {
    public static final int WHITESPACE = 1;
}

PK_ERROR: 'PK_ERROR_code_t' -> mode(FUNCTION);
PK_LINK: 'PK_linkage_m';

//Doesnt work. Once it starts consuming, it doesnt stop.
TEXT_SEA: .*? PK_LINK -> skip;

TEXT_WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;

mode FUNCTION;

//These constants must go above ID rule because we want these to match first.
CONST: 'const';
OPEN_BLOCK: '(';
CLOSE_BLOCK: ');' -> mode(DEFAULT_MODE);
COMMA: ',';
STAR: '*';

COMMENTED_NAME: '/*' ID '*/';
COMMENT_RECEIVED: '/* received */' -> skip;
COMMENT_RETURNED: '/* returned */' -> skip;
COMMENT: '/*' .*? '*/' -> skip;

ID : ID_LETTER (ID_LETTER | DIGIT)*;
fragment ID_LETTER: 'a'..'z' | 'A'..'Z' | '_';
fragment DIGIT: '0'..'9';

WS: ( ' ' | '\r' | '\n' | '\t' ) -> skip;//channel(1);

Here is my parser file KernelParser.g4:

parser grammar KernelParser;

options { tokenVocab=KernelLexer; }

file : func_decl+;

func_decl : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;

param_block: param_decl*;
param_decl: type_decl COMMENTED_NAME COMMA?;
type_decl: CONST? STAR* ID STAR* CONST?;

Here is a simple example input file:

/*some stuff*/

other stuff;

PK_linkage_m PK_ERROR_code_t PK_CLASS_ask_superclass
(
/* received */
PK_CLASS_t         /*class*/,             /* a class */
/* returned */
PK_CLASS_t *const  /*superclass*/         /* immediate superclass of class */
);

/*some stuff*/
blar blar;


PK_linkage_m PK_ERROR_code_t PK_CLASS_is_subclass
(
/* received */
PK_CLASS_t           /*may_be_subclass*/, /* a potential subclass */
PK_CLASS_t           /*class*/,           /* a class */
/* returned */
PK_LOGICAL_t *const  /*is_subclass*/      /* whether it was a subclass */
);


more stuff;

Here is the token output:

line 28:0 token recognition error at: 'more stuff;\r\n'
[@0,312:326='PK_ERROR_code_t',<'PK_ERROR_code_t'>,18:13]
[@1,328:347='PK_CLASS_is_subclass',<ID>,18:29]
[@2,350:350='(',<'('>,19:0]
[@3,369:378='PK_CLASS_t',<ID>,21:0]
[@4,390:408='/*may_be_subclass*/',<COMMENTED_NAME>,21:21]
[@5,409:409=',',<','>,21:40]
[@6,439:448='PK_CLASS_t',<ID>,22:0]
[@7,460:468='/*class*/',<COMMENTED_NAME>,22:21]
[@8,469:469=',',<','>,22:30]
[@9,512:523='PK_LOGICAL_t',<ID>,24:0]
[@10,525:525='*',<'*'>,24:13]
[@11,526:530='const',<'const'>,24:14]
[@12,533:547='/*is_subclass*/',<COMMENTED_NAME>,24:21]
[@13,587:588=');',<');'>,25:0]
[@14,608:607='<EOF>',<EOF>,29:0]

回答1:

It's always difficult to cope with lexer rules "reading everything but ...", but you are on the right path.

After commenting out the skip action on TEXT_SEA: .*? PK_LINK ; //-> skip;, I have observed that the first function was consumed by a second TEXT_SEA (because lexer rules are greedy, TEXT_SEA gives no chance to PK_ERROR to be seen) :

$ grun Kernel file -tokens input.txt 
line 27:0 token recognition error at: 'more stuff;'
[@0,0:41='/*some stuff*/\n\nother stuff;\n\nPK_linkage_m',<TEXT_SEA>,1:0]
[@1,42:292=' PK_ERROR_code_t PK_CLASS_ask_superclass\n(\n/* received */\nPK_CLASS_t         
   /*class*/,    /* a class */\n/* returned */\nPK_CLASS_t *const  /*superclass*/
   /* immediate superclass of class */\n);\n\n/*some stuff*/\nblar blar;\n\n\n
   PK_linkage_m',<TEXT_SEA>,5:12]
[@2,294:308='PK_ERROR_code_t',<'PK_ERROR_code_t'>,17:13]
[@3,310:329='PK_CLASS_is_subclass',<ID>,17:29]

This gave me the idea to use TEXT_SEA both as "sea consumer" and starter of the function mode.

lexer grammar KernelLexer;
// lexer should ignore everything except function declarations
// parser should never see tokens that are irrelevant

@lexer::members {
    public static final int WHITESPACE = 1;
}

PK_LINK: 'PK_linkage_m' ;
TEXT_SEA: .*? PK_LINK  -> mode(FUNCTION);
LINE : .*? ( [\r\n] | EOF ) ;

mode FUNCTION;

//These constants must go above ID rule because we want these to match first.
CONST: 'const';
OPEN_BLOCK: '(';
CLOSE_BLOCK: ');' -> mode(DEFAULT_MODE);
COMMA: ',';
STAR: '*';

PK_ERROR : 'PK_ERROR_code_t' ;
COMMENTED_NAME: '/*' ID '*/';
COMMENT_RECEIVED: '/* received */' -> skip;
COMMENT_RETURNED: '/* returned */' -> skip;
COMMENT: '/*' .*? '*/' -> skip;

ID : ID_LETTER (ID_LETTER | DIGIT)*;
fragment ID_LETTER: 'a'..'z' | 'A'..'Z' | '_';
fragment DIGIT: '0'..'9';

WS: [ \t\r\n]+ -> channel(HIDDEN) ;

parser grammar KernelParser;

options { tokenVocab=KernelLexer; }

file : ( TEXT_SEA | func_decl | LINE )+;

func_decl
    :   PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK
            {System.out.println("---> Found declaration on line " + $start.getLine() + " `" + $text + "`");}
    ;

param_block: param_decl*;
param_decl: type_decl COMMENTED_NAME COMMA?;
type_decl: CONST? STAR* ID STAR* CONST?;

Execution :

$ grun Kernel file -tokens input.txt 
[@0,0:41='/*some stuff*/\n\nother stuff;\n\nPK_linkage_m',<TEXT_SEA>,1:0]
[@1,42:42=' ',<WS>,channel=1,5:12]
[@2,43:57='PK_ERROR_code_t',<'PK_ERROR_code_t'>,5:13]
[@3,58:58=' ',<WS>,channel=1,5:28]
[@4,59:81='PK_CLASS_ask_superclass',<ID>,5:29]
[@5,82:82='\n',<WS>,channel=1,5:52]
[@6,83:83='(',<'('>,6:0]
...
[@24,249:250=');',<');'>,11:0]
[@25,251:292='\n\n/*some stuff*/\nblar blar;\n\n\nPK_linkage_m',<TEXT_SEA>,11:2]
[@26,293:293=' ',<WS>,channel=1,17:12]
[@27,294:308='PK_ERROR_code_t',<'PK_ERROR_code_t'>,17:13]
[@28,309:309=' ',<WS>,channel=1,17:28]
[@29,310:329='PK_CLASS_is_subclass',<ID>,17:29]
[@30,330:330='\n',<WS>,channel=1,17:49]
[@31,331:331='(',<'('>,18:0]
...
[@55,562:563=');',<');'>,24:0]
[@56,564:564='\n',<LINE>,24:2]
[@57,565:565='\n',<LINE>,25:0]
[@58,566:566='\n',<LINE>,26:0]
[@59,567:577='more stuff;',<LINE>,27:0]
[@60,578:577='<EOF>',<EOF>,27:11]
---> Found declaration on line 5 `PK_ERROR_code_t PK_CLASS_ask_superclass
(

PK_CLASS_t         /*class*/,             

PK_CLASS_t *const  /*superclass*/         
);`
---> Found declaration on line 17 `PK_ERROR_code_t PK_CLASS_is_subclass
(

PK_CLASS_t           /*may_be_subclass*/, 
PK_CLASS_t           /*class*/,           

PK_LOGICAL_t *const  /*is_subclass*/      
);`

回答2:

Instead of including .*? at the start of a rule (which I'd always try to avoid), why don't you try to match either:

a PK_ERROR in the default mode (and switch to another mode like you're now doing),
or else match a single character and skip it?

Something like this:

lexer grammar KernelLexer;

PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER    : . -> skip;

mode FUNCTION;

// the rest of your rules as you have them now

Note that this will match PK_ERROR_code_t as well for the input "PK_ERROR_code_t_MU ...", so this would be a safer way:

lexer grammar KernelLexer;

PK_ERROR : 'PK_ERROR_code_t' -> mode(FUNCTION);
OTHER    : ( [a-zA-Z_] [a-zA-Z_0-9]* | . ) -> skip;

mode FUNCTION;

// the rest of your rules as you have them now

Your parser grammar could then look like this:

parser grammar KernelParser;

options { tokenVocab=KernelLexer; }

file        : func_decl+ EOF;
func_decl   : PK_ERROR ID OPEN_BLOCK param_block CLOSE_BLOCK;
param_block : param_decl*;
param_decl  : type_decl COMMENTED_NAME COMMA?;
type_decl   : CONST? STAR* ID STAR* CONST?;

causing your example input to be parsed like this:

来源：https://stackoverflow.com/questions/65659361/antlr-island-grammars-and-a-non-greedy-rule-that-consumes-too-much

标签

parsing

antlr

antlr4