Bison/Flex, reduce/reduce, identifier in different production

隐身守侯 提交于 2019-12-13 04:43:28

问题


I am doing a parser in bison/flex.

This is part of my code:

I want to implement the assignment production, so the identifier can be both boolean_expr or expr, its type will be checked by a symbol table. So it allows something like:

int a = 1;
boolean b = true;
if(b) ...

However, it is reduce/reduce if I include identifier in both term and boolean_expr, any solution to solve this problem?


回答1:


Essentially, what you are trying to do is to inject semantic rules (type information) into your syntax. That's possible, but it is not easy. More importantly, it's rarely a good idea. It's almost always best if syntax and semantics are well delineated.

All the same, as presented your grammar is unambiguous and LALR(1). However, the latter feature is fragile, and you will have difficulty maintaining it as you complete the grammar.

For example, you don't include your assignment syntax in your question, but it would

assignment: identifier '=' expr
          | identifier '=' boolean_expr
          ;

Unlike the rest of the part of the grammar shown, that production is ambiguous, because:

x = y

without knowing anything about y, y could be reduced to either term or boolean_expr.

A possibly more interesting example is the addition of parentheses to the grammar. The obvious way of doing that would be to add two productions:

term: '(' expr ')'

boolean_expr: '(' boolean_expr ')'

The resulting grammar is not ambiguous, but it is no longer LALR(1). Consider the two following declarations:

boolean x = (y) < 7
boolean x = (y)

In the first one, y must be an int so that (y) can be reduced to a term; in the second one y must be boolean so that (y) can be reduced to a boolean_expr. There is no ambiguity; once the < is seen (or not), it is entirely clear which reduction to choose. But < is not the lookahead token, and in fact it could be arbitrarily distant from y:

boolean x = ((((((((((((((((((((((y...

So the resulting unambiguous grammar is not LALR(k) for any k.


One way you could solve the problem would be to inject the type information at the lexical level, by giving the scanner access to the symbol table. Then the scanner could look a scanned identifier token in the symbol table and use the information in the symbol table to decide between one of three token types (or more, if you have more datatypes): undefined_variable, integer_variable, and boolean_variable. Then you would have, for example:

declaration: "int" undefined_variable '=' expr
           | "boolean" undefined_variable '=' boolean_expr
           ;

term: integer_variable
    | ...
    ;

boolean_expr: boolean_variable
            | ...
            ; 

That will work but it should be obvious that this is not scalable: every time you add a type, you'll have to extend both the grammar and the lexical description, because the now the semantics is not only mixed up with the syntax, it has even gotten intermingled with the lexical analysis. Once you let semantics out of its box, it tends to contaminate everything.

There are languages for which this really is the most convenient solution: C parsing, for example, is much easier if typedef names and identifier names are distinguished so that you can tell whether (t)*x is a cast or a multiplication. (But it doesn't work so easily for C++, which has much more complicated name lookup rules, and also much more need for semantic analysis in order to find the correct parse.)

But, honestly, I'd suggest that you do not use C -- and much less C++ -- as a model of how to design a language. Languages which are hard for compilers to parse are also hard for human beings to parse. The "most vexing parse" continues to be a regular source of pain for C++ newcomers, and even sometimes trips up relatively experienced programmers:

class X {
  public:
    X(int n = 0) : data_is_available_(n) {}
    operator bool() const { return data_is_available_; }
    // ...
  private:
    bool data_is_available_;
    // ...
};

X my_x_object();
// ...
if (!x) {
  // This code is unreachable. Can you see why?
}

In short, you're best off with a language which can be parsed into an AST without any semantic information at all. Once the parser has produced the AST, you can do semantic analyses in separate passes, one of which will check type constraints. That's far and away the cleanest solution. Without explicit typing, the grammar is slightly simplified, because an expr now can be any expr:

 expr:        conjunction | expr "or" conjunction ;
 conjunction: comparison  | conjunction "and" comparison ;
 comparison:  product     | product '<' product ;
 product:     factor      | product '*' factor ;
 factor:      term        | factor '+' term ;
 term:        identifier
     |        constant
     |        '(' expr ')'
     ;

Each action in the above would simply create a new AST node and set $$ to the new node. At the end of the parse, the AST is walked to verify that all exprs have the correct type.

If that seems like overkill for your project, you can do the semantic checks in the reduction actions, effectively intermingling the AST walk with the parse. That might seem convenient for immediate evaluation, but it also requires including explicit type information in the parser's semantic type, which adds unnecessary overhead (and, as mentioned, the inelegance of letting semantics interfere with the parser.) In that case, every action would look something like this:

expr : expr '+' expr { CheckArithmeticCompatibility($1, $3);
                       $$ = NewArithmeticNode('+', $1, $3);
                     }


来源:https://stackoverflow.com/questions/25818877/bison-flex-reduce-reduce-identifier-in-different-production

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!