ANTLR parsing MismatchedTokenException

问题

I'm trying to write a simple parser for an even simpler language that I'm writing. It's composed of postfix expressions. As of now, I'm having issues with the parser. When I run it on the input 2 2 * test >> I get a MismatchedTokenException.

Also, how would I go about implementing a recursive postfix parser?

Here's my code:

grammar star;

options {
    language=Python;
    output=AST;
    ASTLabelType=CommonTree;
}

tokens {DECL;}
//start
//  :   decl ;
//decl
//  :   type ID -> ^(DECL type ID)
//  ;

program
    :   (body)+
    ;

body    :   (nested WS)*
    |   (var WS)*
    |   (get WS)*
    ;

var
    :   nested ID '>>'
    ;

get
    :   ID '<<'
    ;

//expressions

term
    :   INT
    ;

expr
    :   term (term operator)*
    ;

nested
    :   expr (expr operator)*
    ;

operator 
    :   ('*' | '+' | '/' | '%' | '-')
    ;

ID
    :   ('a'..'z' | 'A'..'Z') ('a..z' | '0'..'9' | 'A'..'Z')*
    ;

INT
    :   '0'..'9'+
    ;

WS
    :   (' ' | '\n' | '\t' | '\r') {$channel=HIDDEN;}
    ;

回答1:

A couple of things are not correct:

1

You've put the WS token on the HIDDEN channel, which makes them unavailable to parser rules. So all WS tokens inside your body rule are incorrect.

2

_(~~your latest edit removed the left-recursion issue, but I'll still make a point of it~~ sorry, your other question has a left recursive rule (expr), so I'll leave this info in here)_

ANTLR is an LL parser-generator, so you can'r create left-recursive grammars. The following is left recursive:

expr
  :  term term operator
  ;

term
  :  INT
  |  ID
  |  expr
  ;

because the first term inside the expr rule could possible match an expr rule itself. Like any LL parser, ANTLR generated parser cannot cope with left recursion.

3

If you fix the WS issue, your body rule will produce the following error message:

(1/7) Decision can match input such as "INT" using multiple alternatives

This means that the parser cannot "see" to which rule the INT token belongs. This is due to the fact that all your body alternative can be repeated zero or more times and expr and nested are also repeated. And all of them can match an INT, which is what ANTLR is complaining about. If you remove the *'s like this:

body
    :   nested
    |   var
    |   get
    ;

// ...

expr
    :   term (term operator)
    ;

nested
    :   expr (expr operator)
    ;

the errors would disappear (although that would still not cause your input to be parsed properly!).

I realize that this might still sound vague, but it's not trivial to explain (or comprehend if you're new to all this).

4

To properly account for recursive expr inside expr, you'll need to stay clear of left recursion as I explained in #2. You can do that like this:

expr
  :  term (expr operator | term operator)*
  ;

which is still ambiguous, but that is in case of describing a postfix expression using an LL grammar, unavoidable AFAIK. To resolve this, you could enable global backtracking inside the options { ... } section of the grammar:

options {
  language=Python;
  output=AST;
  backtrack=true;
}

Demo

A little demo of how to parse recursive expressions could look like:

grammar star;

options {
  language=Python;
  output=AST;
  backtrack=true;
}

parse
  :  expr EOF -> expr
  ;

expr
  :  (term -> term) ( expr2 operator -> ^(operator $expr expr2) 
                    | term operator  -> ^(operator term term)
                    )*
  ;

expr2 
  :  expr
  ;

term
  :  INT
  |  ID
  ;

operator 
  :  ('*' | '+' | '/' | '%' | '-')
  ;

ID
  :  ('a'..'z' | 'A'..'Z') ('a..z' | '0'..'9' | 'A'..'Z')*
  ;

INT
  :  '0'..'9'+
  ;

WS
  :  (' ' | '\n' | '\t' | '\r') {$channel=HIDDEN;}
  ;

The test script:

#!/usr/bin/env python
import antlr3
from antlr3 import *
from antlr3.tree import *
from starLexer import *
from starParser import *

def print_level_order(tree, indent):
  print '{0}{1}'.format('   '*indent, tree.text)
  for child in tree.getChildren():
    print_level_order(child, indent+1)

input = "5 1 2 + 4 * + 3 -"
char_stream = antlr3.ANTLRStringStream(input)
lexer = starLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = starParser(tokens)
tree = parser.parse().tree 
print_level_order(tree, 0)

produces the following output:

which corresponds to the following AST:

回答2:

The problem is that your body rule never terminates, because it's allowed to match nothing. I didn't fire up ANTLR, I really don't like to mess with it, instead I rewrote your grammar in C++ (using AXE parser generator), added print statements to trace the matches and got the following result from parsing "2 2 * test >>":

parsed term: 2
parsed expr: 2
parsed nested: 2
parsed term: 2
parsed expr: 2
parsed nested: 2
parsed body: 2 2
parsed body:
parsed body: ... here goes your infinite loop

If you are interested in debugging this test case, the AXE grammar is shown below, set breakpoints at prints to step through the parser:

using namespace axe;
typedef std::string::iterator It;

auto space = r_any(" \t\n\r");
auto int_rule = r_numstr();
auto id = r_ident();
auto op = r_any("*+/%-");
auto term = int_rule 
    >> e_ref([](It i1, It i2)
{ 
    std::cout << "\nparsed term: " << std::string(i1, i2); 
});
auto expr = (term & *(term & op)) 
    >> e_ref([](It i1, It i2)
{ 
    std::cout << "\nparsed expr: " << std::string(i1, i2); 
});
auto nested = (expr & *(expr & op))
    >> e_ref([](It i1, It i2)
{ 
    std::cout << "\nparsed nested: " << std::string(i1, i2); 
});
auto get = (id & "<<")  
    >> e_ref([](It i1, It i2)
{ 
    std::cout << "\nparsed get: " << std::string(i1, i2); 
});
auto var = (nested & id & ">>")  
    >> e_ref([](It i1, It i2)
{ 
    std::cout << "\nparsed var: " << std::string(i1, i2); 
});
auto body = (*(nested & space) | *(var & space) | *(get & space))
     >> e_ref([](It i1, It i2)
{ 
    std::cout << "\nparsed body: " << std::string(i1, i2); 
});
auto program = +(body)
    | r_fail([](It i1, It i2) 
{
    std::cout << "\nparsing failed, parsed portion: " 
        << std::string(i1, i2);
});
// test parser
std::ostringstream text;
text << "2 2 * test >>";
std::string str = text.str();
program(str.begin(), str.end());

来源：https://stackoverflow.com/questions/6359881/antlr-parsing-mismatchedtokenexception

标签

python

antlr

parser-generator

antlrworks