Eliminating grammar ambiguity when a rule covers a subset of another

问题

I am trying to build a small bison grammar, but am having an issue with part of the definition. Functions can be called with any expression legal on the right side (expression_list in the grammar) as arguments.

The issue arises because on the left side, functions can be defined by assigning to them (an identifier followed by a list of identifiers - assignment_expression and identifier_list in the grammar)

My question is how I can eliminate the ambiguity in my grammar, since the statements legal on the left side are a subset of those legal on the right.

The grammar is written in bison (v. 2.4.1)

The output from the command was:

2 shift/reduce, 2 reduce/reduce
warning: rule useless in parser due to conflicts: assignment_expression: IDENTIFIER LPAREN RPAREN

Here is the complete grammar:

expression:
    assignment_expression
    | expression DECORATOR IDENTIFIER

value:
    IDENTIFIER
    | HEX 
    | BIN 
    | OCT
    | SCI 
    | FLOAT 
    | INT
    ;

constant_expression:
    value
    | LPAREN constant_expression RPAREN
    | constant_expression OR constant_expression
    | constant_expression XOR constant_expression
    | constant_expression AND constant_expression
    | constant_expression LSHIFT constant_expression
    | constant_expression RSHIFT constant_expression
    | constant_expression PLUS constant_expression
    | constant_expression MINUS constant_expression
    | constant_expression MUL constant_expression
    | constant_expression DIV constant_expression
    | constant_expression MOD constant_expression
    | constant_expression POW constant_expression
    | constant_expression FACTORIAL
    | NOT constant_expression
    | IDENTIFIER LPAREN RPAREN
    | IDENTIFIER LPAREN constant_expression RPAREN
    | IDENTIFIER LPAREN expression_list RPAREN
    ;

expression_list:
    constant_expression COMMA constant_expression
    | expression_list COMMA constant_expression
    ;

assignment_expression:
    constant_expression
    | IDENTIFIER EQUAL assignment_expression
    | IDENTIFIER LPAREN RPAREN
    | IDENTIFIER LPAREN IDENTIFIER RPAREN
    | IDENTIFIER LPAREN identifier_list RPAREN
    ;

identifier_list:
    IDENTIFIER COMMA IDENTIFIER
    | identifier_list COMMA IDENTIFIER
    ;

Here are the relevant sections of bison's output from verbose mode (-v):

State 34 conflicts: 2 shift/reduce
State 35 conflicts: 2 reduce/reduce

state 34

3 value: IDENTIFIER .
25 constant_expression: IDENTIFIER . LPAREN RPAREN
26                    | IDENTIFIER . LPAREN constant_expression RPAREN
27                    | IDENTIFIER . LPAREN expression_list RPAREN
33 assignment_expression: IDENTIFIER LPAREN IDENTIFIER . RPAREN
35 identifier_list: IDENTIFIER . COMMA IDENTIFIER

COMMA   shift, and go to state 53
LPAREN  shift, and go to state 39
RPAREN  shift, and go to state 54

COMMA     [reduce using rule 3 (value)]
RPAREN    [reduce using rule 3 (value)]
$default  reduce using rule 3 (value)


state 35

25 constant_expression: IDENTIFIER LPAREN RPAREN .
32 assignment_expression: IDENTIFIER LPAREN RPAREN .

$end       reduce using rule 25 (constant_expression)
$end       [reduce using rule 32 (assignment_expression)]
DECORATOR  reduce using rule 25 (constant_expression)
DECORATOR  [reduce using rule 32 (assignment_expression)]
$default   reduce using rule 25 (constant_expression)

As per request here is a minimal grammar with the issue:

assignment_expression:
    constant_expression
    | IDENTIFIER LPAREN identifier_list RPAREN
    ;

value:
    IDENTIFIER
    | INT
    ;

constant_expression:
    value
    | IDENTIFIER LPAREN expression_list RPAREN
    ;

expression_list:
    constant_expression COMMA constant_expression
    | expression_list COMMA constant_expression
    ;

identifier_list:
    IDENTIFIER COMMA IDENTIFIER
    | identifier_list COMMA IDENTIFIER
    ;

回答1:

Your text and your grammar don't quite line up. Or maybe I'm not understanding your text correctly. You say:

on the left side, functions can be defined by assigning to them (an identifier followed by a list of identifiers - assignment_expression and identifier_list in the grammar)

In my head, I imagine an example of that to be something like:

comb(n, r) = n! / (r! * (n-r)!)

But your grammar reads:

assignment_expression:
    constant_expression
    | IDENTIFIER EQUAL assignment_expression
    | IDENTIFIER LPAREN RPAREN
    | IDENTIFIER LPAREN IDENTIFIER RPAREN
    | IDENTIFIER LPAREN identifier_list RPAREN

Which will not parse the above definition, because the only thing which can appear to the left hand side of EQUAL is IDENTIFIER. The right-recursion allows any number of repetitions of IDENTIFIER = before an assignment_expression, but the last thing must either be either a constant_expression or one of the three prototype productions. So this would be matched:

c = r = f(a,b)

But so would this:

c = r = f(2, 7)

I'd say that makes your grammar inherently ambiguous, but it is probably an error. What you probably meant was:

assignment_expression: rvalue
                     | lvalue '=' assignment_expression

rvalue: constant_expression

lvalue: IDENTIFIER
      | IDENTIFIER '(' ')'
      | IDENTIFIER '(' identifier_list ')'

I note in passing that your definition of identifier_list as requiring at least two identifiers is unnecessarily complicated, so I've assumed above that the actual definition of identifier_list is:

identifier_list: IDENTIFIER | identifier_list ',' IDENTIFIER

That's not sufficient to solve the problem, though. It still leaves the parser not knowing whether:

comb(n      | lookahead ','

is the start of

comb(n, r) = ...

or just a function call

comb(n, 4)

So to fix that, we need to pull out some heavy artillery.

We can start with the simple solution. This grammar is not ambiguous, since an lvalue must be followed by =. When we finally reach the =, we can tell whether what we have so far is an rvalue or an lvalue, even if they happen to look identical. (comb(n, r), for example.) The only issue is that the = may be an unlimited distance from where we happen to be.

With bison, if we have an unambiguous grammar and we cannot be bothered to fix the lookahead problem, we can ask for a GLR parser. The GLR parser is slightly less efficient because it needs to maintain all possible parses in parallel, but it is still linear complexity for most unambiguous grammars. (GLR parsers can parse even ambiguous grammars in O(N³) but the bison implementation doesn't tolerate ambiguity. It's designed to parse programming languages, after all.)

So to do that, you merely need to add

%glr-parser

and read the section of the bison manual about how semantic actions are affected. (Summary: they are stored up until the parse is disambiguated, so they may not happen as early during the parse as they would in an LALR(1) parser.)

A second simple solution, which is fairly common in practice, is to have the parser accept a superset of the desired language, and then add what is arguably a syntactic check in the semantic action. So you could just write the grammar to allow anything which looks like a call_expression to be on the left-hand-side of an assignment, but when you actually build the AST node for the assignment/definition, verify that the argument list for the call is actually a simple list of identifiers.

Not only does that simplify your grammar without much implementation cost, it makes it possible to generate accurate error messages to describe the syntax error, something which is not easy with a standard LALR(1) parser.

Still, there is an LALR(1) grammar for your language (or, rather, for what I imagine your language to be). In order to produce it, we need to avoid forcing a reduction which would distinguish between an lvalue and an rvalue until we know which one it is.

So the issue will be that an IDENTIFIER could either be part of an expression_list or part of an identifier_list. And we don't know which one, even when we see the ). Consequently, we need to special case IDENTIFIER '(' identifier_list ')', to allow it to be part of both lvalue and rvalue. In other words, we need something like:

lvalue: IDENTIFIER | prototype
rvalue: expression_other_than_lvalue | lvalue

Which leaves the question of how we define expression_other_than_lvalue.

Much of the time, the solution is simple: constants, operator expressions, parenthesized expressions; none of these can be lvalues. A call with a parenthesized list which includes an expression_other_than_identifier is also an expression_other_than_identifier. The only thing which won't count is precisely IDENTIFIER(IDENTIFIER,IDENTIFIER,...)

So let's rewrite the grammar as far as we can. (I've changed constant_expression to lvalue because it was shorter to type. And substituted many token names for the actual symbol, which I find more readable. But much of the following is the same as your original.)

value_not_identifier: HEX | BIN | OCT | SCI | FLOAT | INT

expr_not_lvalue:
    value_not_identifier
    | '(' rvalue ')'
    | rvalue OR rvalue
    | ...
    | IDENTIFIER '(' list_not_id_list ')'

lvalue:
    IDENTIFIER
    | IDENTIFIER '(' ')'
    | IDENTIFIER '(' identifier_list ')'

identifier_list:
    IDENTIFIER | identifier_list ',' IDENTIFIER

Now, aside from the detail that we haven't defined list_not_id_list, everything will fall into place. lvalue and expr_not_lvalue are disjoint, so we can finish up with:

rvalue:
    lvalue
    | expr_not_lvalue

assignment_expression:
    rvalue
    | lvalue '=' assignment_expression

And we only need to deal with expression lists which are not identifier lists. As noted above, that is something like:

expr_not_identifier:
    expr_not_lvalue
    lvalue_not_identifier

list_not_id_list:
    expr_not_identifier
    | list_not_id_list ',' rvalue
    | identifier_list ',' expr_not_identifier

So while parsing a list, the first time we find something which is not an identifier, we remove the list from the identifier_list production. If we get through the whole list, then we might still find ourselves with an lvalue when an rvalue is desired, but that decision (finally) can be made when we see the = or statement terminator.

So the correct (I hope) complete grammar is:

expression:
    assignment_expression
    | expression DECORATOR IDENTIFIER

assignment_expression:
    rvalue
    | lvalue '=' assignment_expression

value_not_identifier: HEX | BIN | OCT | SCI | FLOAT | INT

expr_not_lvalue:
    value_not_identifier
    | '(' rvalue ')'
    | rvalue OR rvalue
    | rvalue XOR rvalue
    | rvalue AND rvalue
    | rvalue LSHIFT rvalue
    | rvalue RSHIFT rvalue
    | rvalue '+' rvalue
    | rvalue '-' rvalue
    | rvalue '*' rvalue
    | rvalue '/' rvalue
    | rvalue '%' rvalue
    | rvalue POW rvalue
    | rvalue '!'
    | NOT rvalue
    | IDENTIFIER '(' list_not_id_list')'

lvalue_not_identifier:
    IDENTIFIER '(' ')'
    | IDENTIFIER '(' identifier_list ')'

lvalue:
    lvalue_not_identifier
    | IDENTIFIER

rvalue:
    lvalue
    | expr_not_lvalue

identifier_list:
    IDENTIFIER | identifier_list ',' IDENTIFIER

list_not_id_list:
    expr_not_identifier
    | list_not_id_list ',' rvalue
    | identifier_list ',' expr_not_identifier

expr_not_identifier:
    expr_not_lvalue
    lvalue_not_identifier

Given the availability of simple solutions, and the inelegance of the transformations required to implement the precise grammar, it is little wonder that you rarely see this sort of construction. However, you will find it used extensively in the ECMA-262 standard (which defines ECMAScript aka Javascript). The grammar formalism used in that report includes a kind of macro feature which simplifies the above transformation, but it doesn't make the grammar any easier to read (imho), and I don't know of a parser generator which implements that feature.

来源：https://stackoverflow.com/questions/35248575/eliminating-grammar-ambiguity-when-a-rule-covers-a-subset-of-another

标签

parsing

grammar

bison

ambiguous