Do I have a bug in my grammar, or the parser-generation tool?

问题

The following is an EBNF-format (mostly - the actual syntax is documented here) grammar that I am attempting to generate a parser for:

expr = lambda_expr_list $;

lambda_expr_list = [ lambda_expr_list "," ] lambda_expr;

lambda_expr = conditional_expr [ "->" lambda_expr ];

conditional_expr = boolean_or_expr [ "if" conditional_expr "else" conditional_expr ];

boolean_or_expr = [ boolean_or_expr "or" ] boolean_xor_expr;

boolean_xor_expr = [ boolean_xor_expr "xor" ] boolean_and_expr;

boolean_and_expr = [ boolean_and_expr "and" ] boolean_not_expr;

boolean_not_expr = [ "not" ] relation;

relation = [ relation ( "=="
                      | "!="
                      | ">"
                      | "<="
                      | "<"
                      | ">="
                      | [ "not" ] "in"
                      | "is" [ "not" ] ) ] bitwise_or_expr;

bitwise_or_expr = [ bitwise_or_expr "|" ] bitwise_xor_expr;

bitwise_xor_expr = [ bitwise_xor_expr "^" ] bitwise_and_expr;

bitwise_and_expr = [ bitwise_and_expr "&" ] bitwise_shift_expr;

bitwise_shift_expr = [ bitwise_shift_expr ( "<<"
                                          | ">>" ) ] subtraction_expr;

subtraction_expr = [ subtraction_expr "-" ] addition_expr;

addition_expr = [ addition_expr "+" ] division_expr;

division_expr = [ division_expr ( "/"
                                | "\\" ) ] multiplication_expr;

multiplication_expr = [ multiplication_expr ( "*"
                                            | "%" ) ] negative_expr;

negative_expr = [ "-" ] positive_expr;

positive_expr = [ "+" ] bitwise_not_expr;

bitwise_not_expr = [ "~" ] power_expr;

power_expr = slice_expr [ "**" power_expr ];

slice_expr = member_access_expr { subscript };

subscript = "[" slice_defn_list "]";

slice_defn_list = [ slice_defn_list "," ] slice_defn;

slice_defn = lambda_expr
           | [ lambda_expr ] ":" [ [ lambda_expr ] ":" [ lambda_expr ] ];

member_access_expr = [ member_access_expr "." ] function_call_expr;

function_call_expr = atom { parameter_list };

parameter_list = "(" [ lambda_expr_list ] ")";

atom = identifier
     | scalar_literal
     | nary_literal;

identifier = /[_A-Za-z][_A-Za-z0-9]*/;

scalar_literal = float_literal
               | integer_literal
               | boolean_literal;

float_literal = point_float_literal
              | exponent_float_literal;

point_float_literal = /[0-9]+?\.[0-9]+|[0-9]+\./;

exponent_float_literal = /([0-9]+|[0-9]+?\.[0-9]+|[0-9]+\.)[eE][+-]?[0-9]+/;

integer_literal = dec_integer_literal
                | oct_integer_literal
                | hex_integer_literal
                | bin_integer_literal;

dec_integer_literal = /[1-9][0-9]*|0+/;

oct_integer_literal = /0[oO][0-7]+/;

hex_integer_literal = /0[xX][0-9a-fA-F]+/;

bin_integer_literal = /0[bB][01]+/;

boolean_literal = "true"
                | "false";

nary_literal = tuple_literal
             | list_literal
             | dict_literal
             | string_literal
             | byte_string_literal;

tuple_literal = "(" [ lambda_expr_list ] ")";

list_literal = "[" [ ( lambda_expr_list
                     | list_comprehension ) ] "]";

list_comprehension = lambda_expr "for" lambda_expr_list "in" lambda_expr [ "if" lambda_expr ];

dict_literal = "{" [ ( dict_element_list
                     | dict_comprehension ) ] "}";

dict_element_list = [ dict_element_list "," ] dict_element;

dict_element = lambda_expr ":" lambda_expr;

dict_comprehension = dict_element "for" lambda_expr_list "in" lambda_expr [ "if" lambda_expr ];

string_literal = /[uU]?[rR]?(\u0027(\\.|[^\\\r\n\u0027])*\u0027|\u0022(\\.|[^\\\r\n\u0022])*\u0022)/;

byte_string_literal = /[bB][rR]?(\u0027(\\[\u0000-\u007F]|[\u0000-\u0009\u000B-\u000C\u000E-\u0026\u0028-\u005B\u005D-\u007F])*\u0027|\u0022(\\[\u0000-\u007F]|[\u0000-\u0009\u000B-\u000C\u000E-\u0021\u0023-\u005B\u005D-\u007F])*\u0022)/;

The tool I'm using to generate the parser is Grako, which generates a modified Packrat parser that claims to support both direct and indirect left recursion.

When I run the generated parser on this string:

input.filter(e -> e[0] in ['t', 'T']).map(e -> (e.len().str(), e)).map(e -> '(Line length: ' + e[0] + ') ' + e[1]).list()

I get the following error:

grako.exceptions.FailedParse: (1:13) Expecting end of text. :
input.filter(e -> e[0] in ['t', 'T']).map(e -> (e.len().str(), e)).map(e -> '(Line length: ' + e[0] + ') ' + e[1]).list()
            ^
expr

Debugging has shown that the parser seems to get to the end of the first e[0], then never backtracks to/reaches a point where it will try to match the in token.

Is there some issue with my grammar such that a left recursion-supporting Packrat parser would fail on it? Or should I file an issue on the Grako issue tracker?

回答1:

It may be a bug in the grammar, but the error message is not telling you where it actually happens. What I always do after finishing a grammar is to embed cut (~) elements throughout it (after keywords like if, operators, opening parenthesis, everywhere it seems reasonable).

The cut element makes the Grako-generated parser commit to the option taken in the closest choice in the parse tree. That way, instead of having the parser fail at the start on an if, it will report failure at the expression it actually couldn't parse.

Some bugs in grammars are difficult to spot, and for that I just go through the parse trace to find out how far in the input the parser went, and why it decided it couldn't go further.

I will not use left-recursion on a PEG parser for professional work, though it may be fine for simpler, academic work.

boolean_or_expr = boolean_xor_expr {"or" boolean_xor_expr};

The associativity can then be handled in a semantic action.

Also see the discussion under issue 49 against Grako. It says that the algorithm used to support left recursion will not always produce the expected associativity in the resulting AST.

来源：https://stackoverflow.com/questions/29044806/do-i-have-a-bug-in-my-grammar-or-the-parser-generation-tool

标签

python

parsing

grammar

grako