ANTLR How to use lexer rules having same starting?

问题

How to use lexer rules having same starting?

I am trying to use two similar lexer rules (having the same starting):

TIMECONSTANT: ('0'..'9')+ ':' ('0'..'9')+;
INTEGER     : ('0'..'9')+;
COLON       : ':';

Here is my sample grammar:

grammar TestTime;

text      : (timeexpr | caseblock)*;

timeexpr  : TIME;
caseblock : INT COLON ID;

TIME      : ('0'..'9')+ ':' ('0'..'9')+;
INT       : ('0'..'9')+;
COLON     : ':';
ID        : ('a'..'z')+;

WS        : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};

When i try to parse text:

12:44
123 : abc
123: abc

First two lines are parsed correctly, 3rd - generates error. For some reason, '123:' ANTLR parses as TIME (while it is not)...

So, is it possible to make grammar with such lexems?

Having such rules is necessary in my language for using both case-blocks and datetime constants. For example in my language it is possible to write:

case MyInt of
  1: a := 01.01.2012;
  2: b := 12:44;
  3: ....
end;

回答1:

As soon DIGIT+ ':' is matched, the lexer expects this to be followed by another DIGIT to match a TIMECONSTANT. If this does not happen, it cannot fall back on another lexer rule that matches DIGIT+ ':' and the lexer will not give up on the already matched ':' to match an INTEGER.

A possible solution would be to optionally match ':' DIGIT+ at the end of the INTEGER rule and change the type of the token if this gets matched:

grammar T;  

parse
 : (t=. {System.out.printf("\%-15s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF
 ;

INTEGER      : DIGIT+ ((':' DIGIT)=> ':' DIGIT+ {$type=TIMECONSTANT;})?;
COLON        : ':';
SPACE        : ' ' {skip();};

fragment DIGIT : '0'..'9';
fragment TIMECONSTANT : ;

When parsing the input:

11: 12:13 : 14

the following will be printed:

INTEGER         '11'
COLON           ':'
TIMECONSTANT    '12:13'
COLON           ':'
INTEGER         '14'

EDIT

Not too nice, but works...

True. However, this is not an ANTLR short coming: most lexer generators I know will have a problem properly tokenizing such a TIMECONSTANT (when INTEGER and COLON are also present). ANTLR at least facilitates a way to handle it in the lexer :)

You could also let this be handled by the parser instead of the lexer:

time_const : INTEGER COLON INTEGER;
INTEGER    : '0'..'9'+;
COLON      : ':';
SPACE      : ' ' {skip();};

However, if your language's lexer ignores white spaces, then input like:

12 :    34

would also be match by the time_const rule, of course.

回答2:

ANTLR lexers can't backtrack, which means once it reaches the ':' in the TIMECONSTANT rule it must complete the rule or an exception will be thrown. You can get your grammar working by using a predicate to test for the presence of a number following the colon.

TIMECONSTANT: ('0'..'9')+ (':' '0'..'9')=> ':' ('0'..'9')+;
INTEGER     : ('0'..'9')+;
COLON       : ':';

This will force ANTLR to look beyond the colon before it decides that it is in a TIMECONSTANT rule.

来源：https://stackoverflow.com/questions/10029137/antlr-how-to-use-lexer-rules-having-same-starting

标签

antlr

grammar

lexer