I\'m trying to use Antlr to make a very simple parser, that basically tokenizes a series of .
-delimited identifiers.
I\'ve made a simple grammar:
Rules starting with a capital letter are Lexer rules.
With the following input file t.text
.
.foobar
.foobar.baz
your grammar (in file Question.g4) produces the following tokens
$ grun Question r -tokens -diagnostics t.text
[@0,0:0='.',,1:0]
[@1,2:8='.foobar',,2:0]
[@2,10:20='.foobar.baz',,3:0]
[@3,22:21='',,4:0]
The lexer (parser) is greedy. It tries to read as many input characters (tokens) as it can with the rule. The lexer rule STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)?
can read a dot, an ID, and again a dot and an ID (due to repetition marker ?
), till the NL. That's why each line ends up in a single token.
When compiling the grammar, the error
warning(146): Question.g4:5:0: non-fragment lexer rule ID can match the empty string
comes because the repetition marker of ID is *
(which means 0 or more times) instead of +
(one or more times).
Now try this grammar :
grammar Question;
r
@init {System.out.println("Question last update 2135");}
: ( structure_selector NL )+ EOF
;
structure_selector
: '.'
| '.' ID structure_selector*
;
ID : [_a-z0-9$]+ ;
NL : [\r\n]+ ;
WS : [ \t]+ -> skip ;
$ grun Question r -tokens -diagnostics t.text
[@0,0:0='.',<'.'>,1:0]
[@1,1:1='\n',,1:1]
[@2,2:2='.',<'.'>,2:0]
[@3,3:8='foobar',,2:1]
[@4,9:9='\n',,2:7]
[@5,10:10='.',<'.'>,3:0]
[@6,11:16='foobar',,3:1]
[@7,17:17='.',<'.'>,3:7]
[@8,18:20='baz',,3:8]
[@9,21:21='\n',,3:11]
[@10,22:21='',,4:0]
Question last update 2135
line 3:7 reportAttemptingFullContext d=1 (structure_selector), input='.'
line 3:7 reportContextSensitivity d=1 (structure_selector), input='.'
and $ grun Question r -gui t.text
displays the hierarchical tree structure you are expecting.