问题
I am trying to build a small bison grammar, but am having an issue with part of the definition. Functions can be called with any expression legal on the right side (expression_list in the grammar) as arguments.
The issue arises because on the left side, functions can be defined by assigning to them (an identifier followed by a list of identifiers - assignment_expression and identifier_list in the grammar)
My question is how I can eliminate the ambiguity in my grammar, since the statements legal on the left side are a subset of those legal on the right.
The grammar is written in bison (v. 2.4.1)
The output from the command was:
2 shift/reduce, 2 reduce/reduce
warning: rule useless in parser due to conflicts: assignment_expression: IDENTIFIER LPAREN RPAREN
Here is the complete grammar:
expression:
assignment_expression
| expression DECORATOR IDENTIFIER
value:
IDENTIFIER
| HEX
| BIN
| OCT
| SCI
| FLOAT
| INT
;
constant_expression:
value
| LPAREN constant_expression RPAREN
| constant_expression OR constant_expression
| constant_expression XOR constant_expression
| constant_expression AND constant_expression
| constant_expression LSHIFT constant_expression
| constant_expression RSHIFT constant_expression
| constant_expression PLUS constant_expression
| constant_expression MINUS constant_expression
| constant_expression MUL constant_expression
| constant_expression DIV constant_expression
| constant_expression MOD constant_expression
| constant_expression POW constant_expression
| constant_expression FACTORIAL
| NOT constant_expression
| IDENTIFIER LPAREN RPAREN
| IDENTIFIER LPAREN constant_expression RPAREN
| IDENTIFIER LPAREN expression_list RPAREN
;
expression_list:
constant_expression COMMA constant_expression
| expression_list COMMA constant_expression
;
assignment_expression:
constant_expression
| IDENTIFIER EQUAL assignment_expression
| IDENTIFIER LPAREN RPAREN
| IDENTIFIER LPAREN IDENTIFIER RPAREN
| IDENTIFIER LPAREN identifier_list RPAREN
;
identifier_list:
IDENTIFIER COMMA IDENTIFIER
| identifier_list COMMA IDENTIFIER
;
Here are the relevant sections of bison's output from verbose mode (-v):
State 34 conflicts: 2 shift/reduce
State 35 conflicts: 2 reduce/reduce
state 34
3 value: IDENTIFIER .
25 constant_expression: IDENTIFIER . LPAREN RPAREN
26 | IDENTIFIER . LPAREN constant_expression RPAREN
27 | IDENTIFIER . LPAREN expression_list RPAREN
33 assignment_expression: IDENTIFIER LPAREN IDENTIFIER . RPAREN
35 identifier_list: IDENTIFIER . COMMA IDENTIFIER
COMMA shift, and go to state 53
LPAREN shift, and go to state 39
RPAREN shift, and go to state 54
COMMA [reduce using rule 3 (value)]
RPAREN [reduce using rule 3 (value)]
$default reduce using rule 3 (value)
state 35
25 constant_expression: IDENTIFIER LPAREN RPAREN .
32 assignment_expression: IDENTIFIER LPAREN RPAREN .
$end reduce using rule 25 (constant_expression)
$end [reduce using rule 32 (assignment_expression)]
DECORATOR reduce using rule 25 (constant_expression)
DECORATOR [reduce using rule 32 (assignment_expression)]
$default reduce using rule 25 (constant_expression)
As per request here is a minimal grammar with the issue:
assignment_expression:
constant_expression
| IDENTIFIER LPAREN identifier_list RPAREN
;
value:
IDENTIFIER
| INT
;
constant_expression:
value
| IDENTIFIER LPAREN expression_list RPAREN
;
expression_list:
constant_expression COMMA constant_expression
| expression_list COMMA constant_expression
;
identifier_list:
IDENTIFIER COMMA IDENTIFIER
| identifier_list COMMA IDENTIFIER
;
回答1:
Your text and your grammar don't quite line up. Or maybe I'm not understanding your text correctly. You say:
on the left side, functions can be defined by assigning to them (an identifier followed by a list of identifiers - assignment_expression and identifier_list in the grammar)
In my head, I imagine an example of that to be something like:
comb(n, r) = n! / (r! * (n-r)!)
But your grammar reads:
assignment_expression:
constant_expression
| IDENTIFIER EQUAL assignment_expression
| IDENTIFIER LPAREN RPAREN
| IDENTIFIER LPAREN IDENTIFIER RPAREN
| IDENTIFIER LPAREN identifier_list RPAREN
Which will not parse the above definition, because the only thing which can appear to the left hand side of EQUAL
is IDENTIFIER
. The right-recursion allows any number of repetitions of IDENTIFIER =
before an assignment_expression, but the last thing must either be either a constant_expression
or one of the three prototype productions. So this would be matched:
c = r = f(a,b)
But so would this:
c = r = f(2, 7)
I'd say that makes your grammar inherently ambiguous, but it is probably an error. What you probably meant was:
assignment_expression: rvalue
| lvalue '=' assignment_expression
rvalue: constant_expression
lvalue: IDENTIFIER
| IDENTIFIER '(' ')'
| IDENTIFIER '(' identifier_list ')'
I note in passing that your definition of identifier_list
as requiring at least two identifiers is unnecessarily complicated, so I've assumed above that the actual definition of identifier_list
is:
identifier_list: IDENTIFIER | identifier_list ',' IDENTIFIER
That's not sufficient to solve the problem, though. It still leaves the parser not knowing whether:
comb(n | lookahead ','
is the start of
comb(n, r) = ...
or just a function call
comb(n, 4)
So to fix that, we need to pull out some heavy artillery.
We can start with the simple solution. This grammar is not ambiguous, since an lvalue
must be followed by =
. When we finally reach the =
, we can tell whether what we have so far is an rvalue
or an lvalue
, even if they happen to look identical. (comb(n, r)
, for example.) The only issue is that the =
may be an unlimited distance from where we happen to be.
With bison, if we have an unambiguous grammar and we cannot be bothered to fix the lookahead problem, we can ask for a GLR parser. The GLR parser is slightly less efficient because it needs to maintain all possible parses in parallel, but it is still linear complexity for most unambiguous grammars. (GLR parsers can parse even ambiguous grammars in O(N3) but the bison implementation doesn't tolerate ambiguity. It's designed to parse programming languages, after all.)
So to do that, you merely need to add
%glr-parser
and read the section of the bison manual about how semantic actions are affected. (Summary: they are stored up until the parse is disambiguated, so they may not happen as early during the parse as they would in an LALR(1) parser.)
A second simple solution, which is fairly common in practice, is to have the parser accept a superset of the desired language, and then add what is arguably a syntactic check in the semantic action. So you could just write the grammar to allow anything which looks like a call_expression
to be on the left-hand-side of an assignment, but when you actually build the AST node for the assignment/definition, verify that the argument list for the call is actually a simple list of identifiers.
Not only does that simplify your grammar without much implementation cost, it makes it possible to generate accurate error messages to describe the syntax error, something which is not easy with a standard LALR(1) parser.
Still, there is an LALR(1) grammar for your language (or, rather, for what I imagine your language to be). In order to produce it, we need to avoid forcing a reduction which would distinguish between an lvalue
and an rvalue
until we know which one it is.
So the issue will be that an IDENTIFIER
could either be part of an expression_list or part of an identifier_list. And we don't know which one, even when we see the )
. Consequently, we need to special case IDENTIFIER '(' identifier_list ')'
, to allow it to be part of both lvalue
and rvalue
. In other words, we need something like:
lvalue: IDENTIFIER | prototype
rvalue: expression_other_than_lvalue | lvalue
Which leaves the question of how we define expression_other_than_lvalue
.
Much of the time, the solution is simple: constants, operator expressions, parenthesized expressions; none of these can be lvalues. A call with a parenthesized list which includes an expression_other_than_identifier
is also an expression_other_than_identifier
. The only thing which won't count is precisely IDENTIFIER(IDENTIFIER,IDENTIFIER,...)
So let's rewrite the grammar as far as we can. (I've changed constant_expression
to lvalue
because it was shorter to type. And substituted many token names for the actual symbol, which I find more readable. But much of the following is the same as your original.)
value_not_identifier: HEX | BIN | OCT | SCI | FLOAT | INT
expr_not_lvalue:
value_not_identifier
| '(' rvalue ')'
| rvalue OR rvalue
| ...
| IDENTIFIER '(' list_not_id_list ')'
lvalue:
IDENTIFIER
| IDENTIFIER '(' ')'
| IDENTIFIER '(' identifier_list ')'
identifier_list:
IDENTIFIER | identifier_list ',' IDENTIFIER
Now, aside from the detail that we haven't defined list_not_id_list
, everything will fall into place. lvalue
and expr_not_lvalue
are disjoint, so we can finish up with:
rvalue:
lvalue
| expr_not_lvalue
assignment_expression:
rvalue
| lvalue '=' assignment_expression
And we only need to deal with expression lists which are not identifier lists. As noted above, that is something like:
expr_not_identifier:
expr_not_lvalue
lvalue_not_identifier
list_not_id_list:
expr_not_identifier
| list_not_id_list ',' rvalue
| identifier_list ',' expr_not_identifier
So while parsing a list, the first time we find something which is not an identifier, we remove the list from the identifier_list
production. If we get through the whole list, then we might still find ourselves with an lvalue
when an rvalue
is desired, but that decision (finally) can be made when we see the =
or statement terminator.
So the correct (I hope) complete grammar is:
expression:
assignment_expression
| expression DECORATOR IDENTIFIER
assignment_expression:
rvalue
| lvalue '=' assignment_expression
value_not_identifier: HEX | BIN | OCT | SCI | FLOAT | INT
expr_not_lvalue:
value_not_identifier
| '(' rvalue ')'
| rvalue OR rvalue
| rvalue XOR rvalue
| rvalue AND rvalue
| rvalue LSHIFT rvalue
| rvalue RSHIFT rvalue
| rvalue '+' rvalue
| rvalue '-' rvalue
| rvalue '*' rvalue
| rvalue '/' rvalue
| rvalue '%' rvalue
| rvalue POW rvalue
| rvalue '!'
| NOT rvalue
| IDENTIFIER '(' list_not_id_list')'
lvalue_not_identifier:
IDENTIFIER '(' ')'
| IDENTIFIER '(' identifier_list ')'
lvalue:
lvalue_not_identifier
| IDENTIFIER
rvalue:
lvalue
| expr_not_lvalue
identifier_list:
IDENTIFIER | identifier_list ',' IDENTIFIER
list_not_id_list:
expr_not_identifier
| list_not_id_list ',' rvalue
| identifier_list ',' expr_not_identifier
expr_not_identifier:
expr_not_lvalue
lvalue_not_identifier
Given the availability of simple solutions, and the inelegance of the transformations required to implement the precise grammar, it is little wonder that you rarely see this sort of construction. However, you will find it used extensively in the ECMA-262 standard (which defines ECMAScript aka Javascript). The grammar formalism used in that report includes a kind of macro feature which simplifies the above transformation, but it doesn't make the grammar any easier to read (imho), and I don't know of a parser generator which implements that feature.
来源:https://stackoverflow.com/questions/35248575/eliminating-grammar-ambiguity-when-a-rule-covers-a-subset-of-another