How to understand and fix conflicts in PLY

最后都变了- 提交于 2019-12-05 20:12:48

In LR parsing, we often talk about "items": an item is a production with a progress marker, usually written with a • but sometimes with a simple .. A state is just a collection of items; in effect, the state tells you the set of productions the parse might be inside.

There is one particularly special type of item: the item with a dot at the end:

(134) attribute_instance_optional_list -> attribute_instance_list .

This represents a production which could be finished, since the progress marker is at the end. If that is the correct production, the parser must then substitute the right-hand side for the left-hand side: this is the action referred to as "reducing" (since it is the opposite of "producing", which is what a "production" does).

However, the mere fact that you are in a state with a possible reduction does not mean that the reduction is possible. It is also necessary that the next token be consistent with the result of the reduction. If the next token could not follow the reduced non-terminal (in the context of the parser's state), then the reduction cannot be performed, so the parser will attempt a shift if one is possible.

Shifts are really simple. A shift is possible if one or more items in the state have the dot before the current lookahead symbol. Here, there is no question about additional lookahead because Ply (like many LALR parser generators) only creates LALR(1) parsers which only have a single lookahead in any state, so the only thing we have to go on is the symbol we are currently looking at, and it is reasonably obvious that we can only process it if some available item has that symbol in the next position.

If a given state with a given lookahead symbol can both shift and reduce, then you have a shift-reduce conflict; the parser doesn't know what to do. (If it has neither a shift nor a reduce available, that indicates that the input has a syntax error. That's how LR parsers identify syntax errors.)

The one important aspect of LR parsing is that a reduction must be performed immediately if it is going to be performed at all. That is, if we are in a state with a possible reduction, and the item's lookahead set indicates that the lookahead character is feasible, we must perform the reduction. We can't wait and see if it would be possible later, because there is no later for a reduction. In other words, anything to the left of the • in an item has already been reduced as much as it could be. (This is the R in LR parsing, which indicates that every reduction is "rightmost". If the use of "rightmost" doesn't make sense, don't worry about it; I only mentioned this fact in case you were wondering.)

Another thing which I might as well mention is that in LALR parsing ("Lookahead LR parsing"), a state is precisely defined by the set of items. Each item has an applicable lookahead set, but the lookahead sets don't form part of the state's identity. If the parser generator ends up producing two states with the same items but different lookahead sets, it must merge them into a single state, forming the union of each lookahead set. For full LR parsing, this limitation doesn't exist; you can (and do) have more than one state for a given set of items, and the result is that the parsing table is much larger and slightly more powerful.

Now, if a shift action is possible, you can mechanically figure out which state will be active after the shift. For example, from

(134) attribute_instance_optional_list -> attribute_instance_list .
(136) attribute_instance_list -> attribute_instance_list . attribute_instance
(138) attribute_instance -> . LPAREN ASTERISK attr_spec_list ASTERISK RPAREN

after shifting an LPAREN, the next state will have just one item:

(138) attribute_instance -> LPAREN . ASTERISK attr_spec_list ASTERISK RPAREN

(Note how the dot has moved.)

That was a simple case, since the next symbol is a terminal, ASTERISK. Most of the time, the next symbol after a shift will be a non-terminal, and in that case we need to add all of the productions for that non-terminal, with the dot at the beginning. (That's how states end up with more than one item.) So, for example, given the new state with one item and an input of ASTERISK (anything else will be an error, since this state has no reduction possibilities), then we will shift into a state which has the shifted item:

(138) attribute_instance -> LPAREN ASTERISK . attr_spec_list ASTERISK RPAREN

plus all the productions for attr_spec_list:

(139)   attr_spec_list -> . attr_spec_list COMMA attr_spec
(140)   attr_spec_list -> . attr_spec

plus all the productions for attr_spec (since we just added an item with the dot before attr_spec):

(141)   attr_spec -> . attr_name
(142)   attr_spec -> . attr_name EQUALS constant_expression

plus the production for attr_name:

(143)   attr_name -> . identifier

and so on until we stop seeing new non-terminals:

(297)   identifier -> . simple_identifier
(298)   identifier -> . escaped_identifier
(350)   simple_identifier -> . ID
(279)   escaped_identifier -> . ESCAPED_ID

OK, now the next token will have to be ID or ESCAPED_ID. Suppose it is ID. Now what? Well, we will shift into a state

(350)   simple_identifier -> ID .

with a possible reduction; assuming the lookahead symbol matches the lookahead set (I haven't and don't intend to explain how lookahead sets are computed for each state; there's an algorithm but its details aren't relevant here), then the ID will be reduced to simple_identifier. Then where does the parser go? Logically, it goes back to the state which generated the simple_identifier production, and shift the simple_identifier. As it happens, the state is the one we just created

(138)   attribute_instance -> LPAREN ASTERISK . attr_spec_list ASTERISK RPAREN
(139)   attr_spec_list -> . attr_spec_list COMMA attr_spec
(140)   attr_spec_list -> . attr_spec
(141)   attr_spec -> . attr_name
(142)   attr_spec -> . attr_name EQUALS constant_expression
(143)   attr_name -> . identifier
(297)   identifier -> . simple_identifier
(298)   identifier -> . escaped_identifier
(350)   simple_identifier -> . ID
(279)   escaped_identifier -> . ESCAPED_ID

and after we shift the simple_identifier, we end up with

(297)   identifier -> simple_identifier .

which is a state which requires a reduction to identifier, so once again back to the same state after which we find ourselves in

(143)   attr_name -> identifier . 

and then

(141)   attr_spec -> attr_name .
(142)   attr_spec -> attr_name . EQUALS constant_expression

But how did the parser know which state to go back to on each of those reductions? The answer is that the parser pushes the current state onto the parsing stack with every symbol. When it does a reduction, it pops the symbols from the right-hand side, discarding each associated state number, until it gets to the beginning of the right-hand-side, at which point the stack indicates which state that right-hand side came from. It then takes a look at that state, shifts the reduced non-terminal, and pushes the new shifted state onto the parse stack.

So I think that answers the questions "What do the lines at the beginning of the state description mean?" and "What state does the parser go to after a reduction?" The other two questions are easy to answer: "No, it doesn't compute all the possible predecessor states", and "Yes, it could (although it might end up adding predecessors which are actually not possible with any input) but it isn't useful for the parse." but since they're not horribly relevant to solving the shift-reduce conflict, I won't explain the answer further.

Going back to the actual shift-reduce conflict, the situation is that we are in the state

(134) attribute_instance_optional_list -> attribute_instance_list .
(136) attribute_instance_list -> attribute_instance_list . attribute_instance
(138) attribute_instance -> . LPAREN ASTERISK attr_spec_list ASTERISK RPAREN

which has a possible reduction, and we are considering the case where we see an LPAREN, for which there is a possible shift, and it turns out that the lookahead set for the first item also include LPAREN. Although the lookahead set is not shown in the PLY output, we can dig around in the grammar to see where it might have come from. The immediate source is attribute_instance_optional_list, of course, and we can find that in the grammar,although there are quite a few possibilities:

(27)    module_nonansi_header -> attribute_instance_optional_list module_keyword lifetime module_identifier package_import_declaration_list parameter_port_list list_of_ports SEMICOLON
(28)    module_ansi_header -> attribute_instance_optional_list module_keyword lifetime module_identifier package_import_declaration_list parameter_port_list list_of_port_declarations_list SEMICOLON
(29)    module_implicit_header -> attribute_instance_optional_list module_keyword lifetime module_identifier LPAREN DOT ASTERISK RPAREN SEMICOLON
(36)    port_declaration -> attribute_instance_optional_list inout_declaration
(37)    port_declaration -> attribute_instance_optional_list input_declaration
(38)    port_declaration -> attribute_instance_optional_list output_declaration
(39)    port_declaration -> attribute_instance_optional_list ref_declaration
(40)    port_declaration -> attribute_instance_optional_list interface_port_declaration
(125)   struct_union_member -> attribute_instance_optional_list data_type_or_void list_of_variable_decl_assignments
(126)   struct_union_member -> attribute_instance_optional_list random_qualifier data_type_or_void list_of_variable_decl_assignments
(144)   inc_or_dec_expression -> inc_or_dec_operator attribute_instance_optional_list variable_lvalue
(145)   inc_or_dec_expression -> variable_lvalue attribute_instance_optional_list inc_or_dec_operator
(146)   conditional_expression -> cond_predicate INTERROGATION attribute_instance_optional_list expression COLON expression
(148)   constant_expression -> unary_operator attribute_instance_optional_list constant_primary
(149)   constant_expression -> constant_expression binary_operator attribute_instance_optional_list constant_expression
(150)   constant_expression -> constant_expression INTERROGATION attribute_instance_optional_list constant_expression COLON constant_expression
(167)   expression -> unary_operator attribute_instance_optional_list primary
(170)   expression -> expression binary_operator attribute_instance_optional_list expression
(181)   module_path_conditional_expression -> module_path_expression INTERROGATION attribute_instance_optional_list module_path_expression COLON module_path_expression
(183)   module_path_expression -> unary_module_path_operator attribute_instance_optional_list module_path_primary
(184)   module_path_expression -> module_path_expression binary_module_path_operator attribute_instance_optional_list module_path_expression

As far as I can see, attribute_instance_optional_list does not appear at the end of any of those productions, which simplifies working out where the LPAREN conflict comes from. In all those cases, it is followed by a non-terminal, the possibilities being:

module_keyword
inout_declaration
input_declaration
output_declaration
ref_declaration
interface_port_declaration
data_type_or_void
random_qualifier
variable_lvalue
inc_or_dec_operator
constant_primary
constant_expression
primary
expression
module_path_primary
module_path_expression  

Now, if any of those non-terminals could start with an LPAREN, we have a possible shift-reduce conflict. And a couple of culprits spring out of the list: expression and similar.

So, there is the problem, in summary: an attribute_instance can start with a parenthesis, but an attribute_instance_list can also be followed by a parenthesis. So when you're in the middle of an attribute_instance_list and you see a (, you have no way of knowing whether to shift or reduce.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!