I\'m trying to write a simple parser for an even simpler language that I\'m writing. It\'s composed of postfix expressions. As of now, I\'m having issues with the parser. When I
The problem is that your body rule never terminates, because it's allowed to match nothing. I didn't fire up ANTLR, I really don't like to mess with it, instead I rewrote your grammar in C++ (using AXE parser generator), added print statements to trace the matches and got the following result from parsing "2 2 * test >>"
:
parsed term: 2
parsed expr: 2
parsed nested: 2
parsed term: 2
parsed expr: 2
parsed nested: 2
parsed body: 2 2
parsed body:
parsed body: ... here goes your infinite loop
If you are interested in debugging this test case, the AXE grammar is shown below, set breakpoints at prints to step through the parser:
using namespace axe;
typedef std::string::iterator It;
auto space = r_any(" \t\n\r");
auto int_rule = r_numstr();
auto id = r_ident();
auto op = r_any("*+/%-");
auto term = int_rule
>> e_ref([](It i1, It i2)
{
std::cout << "\nparsed term: " << std::string(i1, i2);
});
auto expr = (term & *(term & op))
>> e_ref([](It i1, It i2)
{
std::cout << "\nparsed expr: " << std::string(i1, i2);
});
auto nested = (expr & *(expr & op))
>> e_ref([](It i1, It i2)
{
std::cout << "\nparsed nested: " << std::string(i1, i2);
});
auto get = (id & "<<")
>> e_ref([](It i1, It i2)
{
std::cout << "\nparsed get: " << std::string(i1, i2);
});
auto var = (nested & id & ">>")
>> e_ref([](It i1, It i2)
{
std::cout << "\nparsed var: " << std::string(i1, i2);
});
auto body = (*(nested & space) | *(var & space) | *(get & space))
>> e_ref([](It i1, It i2)
{
std::cout << "\nparsed body: " << std::string(i1, i2);
});
auto program = +(body)
| r_fail([](It i1, It i2)
{
std::cout << "\nparsing failed, parsed portion: "
<< std::string(i1, i2);
});
// test parser
std::ostringstream text;
text << "2 2 * test >>";
std::string str = text.str();
program(str.begin(), str.end());
A couple of things are not correct:
You've put the WS
token on the HIDDEN
channel, which makes them unavailable to parser rules. So all WS
tokens inside your body
rule are incorrect.
_(your latest edit removed the left-recursion issue, but I'll still make a point of it sorry, your other question has a left recursive rule (expr
), so I'll leave this info in here)_
ANTLR is an LL parser-generator, so you can'r create left-recursive grammars. The following is left recursive:
expr
: term term operator
;
term
: INT
| ID
| expr
;
because the first term
inside the expr
rule could possible match an expr
rule itself. Like any LL parser, ANTLR generated parser cannot cope with left recursion.
If you fix the WS
issue, your body
rule will produce the following error message:
(1/7) Decision can match input such as "INT" using multiple alternatives
This means that the parser cannot "see" to which rule the INT
token belongs. This is due to the fact that all your body
alternative can be repeated zero or more times and expr
and nested
are also repeated. And all of them can match an INT
, which is what ANTLR is complaining about. If you remove the *
's like this:
body
: nested
| var
| get
;
// ...
expr
: term (term operator)
;
nested
: expr (expr operator)
;
the errors would disappear (although that would still not cause your input to be parsed properly!).
I realize that this might still sound vague, but it's not trivial to explain (or comprehend if you're new to all this).
To properly account for recursive expr
inside expr
, you'll need to stay clear of left recursion as I explained in #2. You can do that like this:
expr
: term (expr operator | term operator)*
;
which is still ambiguous, but that is in case of describing a postfix expression using an LL grammar, unavoidable AFAIK. To resolve this, you could enable global backtracking inside the options { ... }
section of the grammar:
options {
language=Python;
output=AST;
backtrack=true;
}
A little demo of how to parse recursive expressions could look like:
grammar star;
options {
language=Python;
output=AST;
backtrack=true;
}
parse
: expr EOF -> expr
;
expr
: (term -> term) ( expr2 operator -> ^(operator $expr expr2)
| term operator -> ^(operator term term)
)*
;
expr2
: expr
;
term
: INT
| ID
;
operator
: ('*' | '+' | '/' | '%' | '-')
;
ID
: ('a'..'z' | 'A'..'Z') ('a..z' | '0'..'9' | 'A'..'Z')*
;
INT
: '0'..'9'+
;
WS
: (' ' | '\n' | '\t' | '\r') {$channel=HIDDEN;}
;
The test script:
#!/usr/bin/env python
import antlr3
from antlr3 import *
from antlr3.tree import *
from starLexer import *
from starParser import *
def print_level_order(tree, indent):
print '{0}{1}'.format(' '*indent, tree.text)
for child in tree.getChildren():
print_level_order(child, indent+1)
input = "5 1 2 + 4 * + 3 -"
char_stream = antlr3.ANTLRStringStream(input)
lexer = starLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = starParser(tokens)
tree = parser.parse().tree
print_level_order(tree, 0)
produces the following output:
- + 5 * + 1 2 4 3
which corresponds to the following AST: