ANTLR parsing MismatchedTokenException

后端 未结 2 784
无人共我
无人共我 2021-01-25 00:42

I\'m trying to write a simple parser for an even simpler language that I\'m writing. It\'s composed of postfix expressions. As of now, I\'m having issues with the parser. When I

相关标签:
2条回答
  • 2021-01-25 01:18

    The problem is that your body rule never terminates, because it's allowed to match nothing. I didn't fire up ANTLR, I really don't like to mess with it, instead I rewrote your grammar in C++ (using AXE parser generator), added print statements to trace the matches and got the following result from parsing "2 2 * test >>":

    parsed term: 2
    parsed expr: 2
    parsed nested: 2
    parsed term: 2
    parsed expr: 2
    parsed nested: 2
    parsed body: 2 2
    parsed body:
    parsed body: ... here goes your infinite loop
    

    If you are interested in debugging this test case, the AXE grammar is shown below, set breakpoints at prints to step through the parser:

    using namespace axe;
    typedef std::string::iterator It;
    
    auto space = r_any(" \t\n\r");
    auto int_rule = r_numstr();
    auto id = r_ident();
    auto op = r_any("*+/%-");
    auto term = int_rule 
        >> e_ref([](It i1, It i2)
    { 
        std::cout << "\nparsed term: " << std::string(i1, i2); 
    });
    auto expr = (term & *(term & op)) 
        >> e_ref([](It i1, It i2)
    { 
        std::cout << "\nparsed expr: " << std::string(i1, i2); 
    });
    auto nested = (expr & *(expr & op))
        >> e_ref([](It i1, It i2)
    { 
        std::cout << "\nparsed nested: " << std::string(i1, i2); 
    });
    auto get = (id & "<<")  
        >> e_ref([](It i1, It i2)
    { 
        std::cout << "\nparsed get: " << std::string(i1, i2); 
    });
    auto var = (nested & id & ">>")  
        >> e_ref([](It i1, It i2)
    { 
        std::cout << "\nparsed var: " << std::string(i1, i2); 
    });
    auto body = (*(nested & space) | *(var & space) | *(get & space))
         >> e_ref([](It i1, It i2)
    { 
        std::cout << "\nparsed body: " << std::string(i1, i2); 
    });
    auto program = +(body)
        | r_fail([](It i1, It i2) 
    {
        std::cout << "\nparsing failed, parsed portion: " 
            << std::string(i1, i2);
    });
    // test parser
    std::ostringstream text;
    text << "2 2 * test >>";
    std::string str = text.str();
    program(str.begin(), str.end());
    
    0 讨论(0)
  • 2021-01-25 01:40

    A couple of things are not correct:

    1

    You've put the WS token on the HIDDEN channel, which makes them unavailable to parser rules. So all WS tokens inside your body rule are incorrect.

    2

    _(your latest edit removed the left-recursion issue, but I'll still make a point of it sorry, your other question has a left recursive rule (expr), so I'll leave this info in here)_

    ANTLR is an LL parser-generator, so you can'r create left-recursive grammars. The following is left recursive:

    expr
      :  term term operator
      ;
    
    term
      :  INT
      |  ID
      |  expr
      ;
    

    because the first term inside the expr rule could possible match an expr rule itself. Like any LL parser, ANTLR generated parser cannot cope with left recursion.

    3

    If you fix the WS issue, your body rule will produce the following error message:

    (1/7) Decision can match input such as "INT" using multiple alternatives

    This means that the parser cannot "see" to which rule the INT token belongs. This is due to the fact that all your body alternative can be repeated zero or more times and expr and nested are also repeated. And all of them can match an INT, which is what ANTLR is complaining about. If you remove the *'s like this:

    body
        :   nested
        |   var
        |   get
        ;
    
    // ...
    
    expr
        :   term (term operator)
        ;
    
    nested
        :   expr (expr operator)
        ;
    

    the errors would disappear (although that would still not cause your input to be parsed properly!).

    I realize that this might still sound vague, but it's not trivial to explain (or comprehend if you're new to all this).

    4

    To properly account for recursive expr inside expr, you'll need to stay clear of left recursion as I explained in #2. You can do that like this:

    expr
      :  term (expr operator | term operator)*
      ;
    

    which is still ambiguous, but that is in case of describing a postfix expression using an LL grammar, unavoidable AFAIK. To resolve this, you could enable global backtracking inside the options { ... } section of the grammar:

    options {
      language=Python;
      output=AST;
      backtrack=true;
    }
    

    Demo

    A little demo of how to parse recursive expressions could look like:

    grammar star;
    
    options {
      language=Python;
      output=AST;
      backtrack=true;
    }
    
    parse
      :  expr EOF -> expr
      ;
    
    expr
      :  (term -> term) ( expr2 operator -> ^(operator $expr expr2) 
                        | term operator  -> ^(operator term term)
                        )*
      ;
    
    expr2 
      :  expr
      ;
    
    term
      :  INT
      |  ID
      ;
    
    operator 
      :  ('*' | '+' | '/' | '%' | '-')
      ;
    
    ID
      :  ('a'..'z' | 'A'..'Z') ('a..z' | '0'..'9' | 'A'..'Z')*
      ;
    
    INT
      :  '0'..'9'+
      ;
    
    WS
      :  (' ' | '\n' | '\t' | '\r') {$channel=HIDDEN;}
      ;
    

    The test script:

    #!/usr/bin/env python
    import antlr3
    from antlr3 import *
    from antlr3.tree import *
    from starLexer import *
    from starParser import *
    
    def print_level_order(tree, indent):
      print '{0}{1}'.format('   '*indent, tree.text)
      for child in tree.getChildren():
        print_level_order(child, indent+1)
    
    input = "5 1 2 + 4 * + 3 -"
    char_stream = antlr3.ANTLRStringStream(input)
    lexer = starLexer(char_stream)
    tokens = antlr3.CommonTokenStream(lexer)
    parser = starParser(tokens)
    tree = parser.parse().tree 
    print_level_order(tree, 0)
    

    produces the following output:

    -
       +
          5
          *
             +
                1
                2
             4
       3

    which corresponds to the following AST:

    enter image description here

    0 讨论(0)
提交回复
热议问题