问题
While working on Antlr 3.5 grammar for Java parsing noticed that 'IDENTIFIER' rule consumes few Keywords in ANTLR Lexer grammar. The Lexer grammar is
lexer grammar JavaLexer;
options {
//k=8;
language=Java;
filter=true;
//backtrack=true;
}
@lexer::header {
package java;
}
@lexer::members {
public ArrayList<String> keywordsList = new ArrayList<String>();
}
V_DECLARATION
:
( ((MODIFIERS)=>tok1=MODIFIERS WS+)? tok2=TYPE WS+ var=V_DECLARATOR WS* )
{...};
fragment
V_DECLARATOR
:
(
tok=IDENTIFIER WS* ( ',' | ';' | ASSIGN WS* V_VALUE )
)
{...};
fragment
V_VALUE
: (IDENTIFIER (DOT WS* IDENTIFIER WS* '(' | ',' | ';'))
;
MODIFIERS
:
(PUBLIC | PRIVATE | FINAL)+
;
PRIVATE
: tok = 'private'
{ keywordsList.add($tok.getText()); }
;
PUBLIC
: tok = 'public'
{ keywordsList.add($tok.getText()); }
;
DOT
: '.'
{ keywordsList.add("."); }
;
THIS
: tok = 'this'
{ keywordsList.add($tok.getText()); }
;
ASSIGN
: '='
{ keywordsList.add("="); }
;
IDENTIFIER:
tok =Identifier
{
//System.out.println("Identifier: " + $tok.text);
}
;
fragment
Identifier
: (Letter (Letter|JavaIDDigit)*);
fragment
Letter
: '\u0024' |
'\u0041'..'\u005a' |
'\u005f' |
'\u0061'..'\u007a' |
'\u00c0'..'\u00d6' |
'\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' |
'\u0100'..'\u1fff' |
'\u3040'..'\u318f' |
'\u3300'..'\u337f' |
'\u3400'..'\u3d2d' |
'\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
fragment
JavaIDDigit
: '\u0030'..'\u0039' |
'\u0660'..'\u0669' |
'\u06f0'..'\u06f9' |
'\u0966'..'\u096f' |
'\u09e6'..'\u09ef' |
'\u0a66'..'\u0a6f' |
'\u0ae6'..'\u0aef' |
'\u0b66'..'\u0b6f' |
'\u0be7'..'\u0bef' |
'\u0c66'..'\u0c6f' |
'\u0ce6'..'\u0cef' |
'\u0d66'..'\u0d6f' |
'\u0e50'..'\u0e59' |
'\u0ed0'..'\u0ed9' |
'\u1040'..'\u1049'
;
WS : (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN; skip();}
;
When I try to parse the line :
public final int inch = this.getValue();
Then the rule 'VAR_VALUE -> IDENTIFIER', also consumes the "this" keyword, which is undesirable, since keywords also be collected into a separate list.
Is there any trick/provision in Antlr grammar to match the keywords by itself rule without effecting the other functionality like "IDENTIFIER"?
回答1:
Your problem is indeed caused by the misunderstanding of what belongs in lexer and what belongs in parser:
- Lexer's job is to determine which words the stream of characters represent
- e.g. that
this
is aTHIS
,0
is aNUMBER
andthat
is anIDENTIFIER
- e.g. that
- Parser's job is to determine whether the sequence of words emitted from lexer conform to the given language, that is, whether the "sentence" made of those word makes sense
- e.g. that declaration consists of possible modifiers, a type, and a list of identifiers
Since lexer's job is to determine which words are on the input, it processes the input and looks for longest valid match (in ANTLR, if two or more rules accept same input, the topmost one in source grammar wins). Not for any "most specific", but simply the longest one.
Example:
- Input
t
- Can be
THIS
orIDENTIFIER
- Can be
- Input
h
- Still can be
THIS
orIDENTIFIER
- Still can be
- Input
a
- Can no longer be
THIS
, onlyIDENTIFIER
is possible
- Can no longer be
- Input
t
IDENTIFIER
for sure
- Input
.
- No longer matches
IDENTIFIER
, sothat
will be matched asIDENTIFIER
and the last input.
will be matched as a new start of next token
- No longer matches
And another example:
- Input
t
,h
,i
,s
- Can be matched as either
THIS
orIDENTIFIER
whole time
- Can be matched as either
- Input
.
- Can no longer be matched by anything, so
this
will be matched asTHIS
(topmost matching rule) rather thanIDENTIFIER
and.
will start a new token
- Can no longer be matched by anything, so
And now to the important part - as long as a lexer rule is referenced from another lexer rule, it's considered to be merely a fragment of the referencing lexer rule. This means that matching it won't emit a new token, and also that it won't trigger any decisions between multiple matching tokens at the end of the fragment's match. Since this
can indeed be matched by IDENTIFIER
rule, the whole declaration conforms to the V_DECLARATION
lexer rule - so unless there's another lexer rule that can match at least the same length of input and is earlier in the grammar than this rule, this rule will apply.
You didn't provide any rule referencing THIS
so we don't know how exactly this plays out in your grammar, but the obvious cause is that lexer can match longer input or with earlier rule than anything that uses THIS
rule.
来源:https://stackoverflow.com/questions/42221144/identifier-rule-also-consumes-keyword-in-antlr-lexer-grammar