ANTLR lexer rule consumes too much

喜欢而已 提交于 2019-12-24 08:27:46

问题


ANTLR Lexer Rule Design

I have a requirement for the following token:

  • Allowable characters include uppercase, lowercase, numeric, space, and hyphen characters
  • Unfixed length (must be at least two characters in length)
  • Token must contain at least one space or hyphen
  • Token must start and end in an uppercase, lowercase, numeric, space, or hyphen character (cannot begin or end with a space)

The ANTLR lexer rule "AlphaNumericSpaceHyphen" in the grammar below almost works except for one case. Using the parser rule "sic" to test, the following input will parse (without quotes):

"STANDARD INDUSTRIAL CLASSIFICATION: WATER TRANSPORTATION[4400]"

The following input fails to parse (without quotes):

"STANDARD INDUSTRIAL CLASSIFICATION: WATER TRANSPORTATION [4400]"

The issue being that the lexer rule "AlphaNumericSpaceHyphen" consumes the space and the left square bracket after "WATER TRANSPORTATION" before the lexer realizes that there is no match because it went too far.

I have experimented with various type of predicates and look aheads without any luck. Any help is greatly appreciated.

grammar T;

sic: SICSpecifier AlphaNumericSpaceHyphen  LEFTBRACKET Digits RIGHTBRACKET;

LEFTBRACKET  
:   '[';  

RIGHTBRACKET 
:   ']';

SICSpecifier: 'STANDARD INDUSTRIAL CLASSIFICATION:';

WS : (' '|'\t')+ 
{   
  $channel = HIDDEN;  
};  

fragment UCASEALPHA : 'A'..'Z';
fragment LCASEALPHA : 'a'..'z';
fragment DIGIT : '0'..'9';
Digits: DIGIT+;

AlphaNumericSpaceHyphen 
:           (UCASEALPHA|LCASEALPHA |DIGIT|'-')+  (' ' (UCASEALPHA|LCASEALPHA |DIGIT|'-')+)+   
        |   (UCASEALPHA|LCASEALPHA |DIGIT)+ ('-')+  ((' '|UCASEALPHA|LCASEALPHA |DIGIT|'-')* (UCASEALPHA|LCASEALPHA |DIGIT|'-'))?
        |   ('-')+ (UCASEALPHA|LCASEALPHA |DIGIT)+  ((UCASEALPHA|LCASEALPHA |DIGIT|'-'|' ')* (UCASEALPHA|LCASEALPHA |DIGIT|'-'))?   
        ;

回答1:


Unfortunately there is no backtracking for the lexer rules. You can take a look at

ANTLR lexer rule consumes characters even if not matched?

You can try to adapt your grammar so that you can change the type of the token as it is suggested in this solution.

Hope this is going to help you.



来源:https://stackoverflow.com/questions/23601038/antlr-lexer-rule-consumes-too-much

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!