How to modify parsing grammar to allow assignment and non-assignment statements?

丶灬走出姿态 提交于 2020-01-24 15:46:11

问题


So the question is about the grammar below. I'm working on a mini-interpreted language for fun (we learned about some compiler design in class, so I want to take it to the next level and try something on my own). I'm stuck trying to make the non-terminal symbol Expr.

Statement ::= Expr SC
Expr ::=           /* I need help here */
Assign ::= Name EQUAL Expr
AddSub ::= MulDiv {(+|-) AddSub}
MulDiv ::= Primary {(*|/) MulDiv}
Primary ::= INT | FLOAT | STR | LP Expr RP | Name
Name ::= ID {. Name}

Expr has to be made such that Statement must allow for the two cases:

  1. x = 789; (regular assignment, followed by semicolon)
  2. x+2; (no assignment, just calculation, discarded; followed by a semicolon)

The purpose of the second case is to setup the foundation for more changes in the future. I was thinking about unary increment and decrement operators, and also function calls; both of which don't require assignment to be meaningful.

I've looked at other grammars (C# namely), but it was too complicated and lengthy to understand. Naturally I'm not looking for solutions, but only for guidance on how I could modify my grammar.

All help is appreciated.

EDIT: I should say that my initial thought was Expr ::= Assign | AddSub, but that wouldn't work since it would create ambiguity since both could start with the non-terminal symbol Name. I have made my tokenizer such that it allows one token look ahead (peek), but I have not made such a thing for the non terminals, since it would be trying to fix a problem that could be avoided (ambiguity). In the grammar, the terminals are the ones that are all-caps.


回答1:


The simplest solution is the one actually taken by the designers of C, and thus by the various C derivatives: treat assignment simply as yet another operator, without restricting it to being at the top-level of a statement. Hence, in C, the following is unproblematic:

while ((ch = getchar()) != EOF) { ... }

Not everyone will consider that good style, but it is certainly common (particularly in the clauses of the for statement, whose syntax more or less requires that assignment be an expression).

There are two small complications, which are relatively easy to accomplish:

  1. Logically, and unlike most operators, assignment associates to the right so that a = b = 0 is parsed as a = (b = 0) and not (a = b) = 0 (which would be highly unexpected). It also binds very weakly, at least to the right.

    Opinions vary as to how tightly it should bind to the left. In C, for the most part a strict precedence model is followed so that a = 2 + b = 3 is rejected since it is parsed as a = ((2 + b) = 3). a = 2 + b = 3 might seem like terrible style, but consider also a < b ? (x = a) : (y = a). In C++, where the result of the ternary operator can be a reference, you could write that as (a < b ? x : y) = a in which the parentheses are required even thought assignment has lower precedence than the ternary operator.

    None of these options are difficult to implement in a grammar, though.

  2. In many languages, the left-hand side of an assignment has a restricted syntax. In C++, which has reference values, the restriction could be considered semantic, and I believe it is usually implemented with a semantic check, but in many C derivatives lvalue can be defined syntactically. Such definitions are unambiguous, but they are often not amenable to parsing with a top-down grammar, and they can create complications even for a bottom-up grammar. Doing the check post-parse is always a simple solution.

If you really want to distinguish assignment statements from expression statements, then you indeed run into the problem of prediction failure (not ambiguity) if you use a top-down parsing technique such as recursive descent. Since the grammar is not ambiguous, a simple solution is to use an LALR(1) parser generator such as bison/yacc, which has no problems parsing such a grammar since it does not require an early decision as to which kind of statement is being parsed. On the whole, the use of LALR(1) or even GLR parser generators simplifies implementation of a parser by allowing you to specify a grammar in a form which is easily readable and corresponds to the syntactic analysis. (For example, an LALR(1) parser can handle left-associative operators naturally, while a LL(1) grammar can only produce right-associative parses and therefore requires some kind of reconstruction of the syntax tree.)

A recursive descent parser is a computer program, not a grammar, and its expressiveness is thus not limited by the formal constraints of LL(1) grammars. That is both a strength and a weakness: the strength is that you can find solutions which are not limited by the limitations of LL(1) grammars; the weakness is that it is much more complicated (even, sometimes, impossible) to extract a clear statement about the precise syntax of the language. This power, for example, allows recursive descent grammars to handle left associativity in a more-or-less natural way despite the restriction mentioned above.

If you want to go down this road, then the solution is simple enough. You will have some sort of function:

/* This function parses and returns a single expression */
Node expr() {
  Node left = value();
  while (true) {
    switch (lookahead) {
      /* handle each possible operator token. I left out
       * the detail of handling operator precedence since it's
       * not relevant here
       */
      case OP_PLUS: {
        accept(lookahead);
        left = MakeNode(OP_PLUS, left, value());
        break;
      }
      /* If no operator found, return the current expression */
      default:
        return left;
    }
  }
}

That easily be modified to be able to parse both expressions and statements. First, refactor the function so that it parses the "rest" of an expression, given the first operator. (The only change is a new prototype and the deletion of the first line in the body.)

/* This function parses and returns a single expression
 * after the first value has been parsed. The value must be
 * passed as an argument.
 */
Node expr_rest(Node left) {
  while (true) {
    switch (lookahead) {
      /* handle each possible operator token. I left out
       * the detail of handling operator precedence since it's
       * not relevant here
       */
      case OP_PLUS: {
        accept(lookahead);
        left = MakeNode(OP_PLUS, left, value());
        break;
      }
      /* If no operator found, return the current expression */
      default:
        return left;
    }
  }
}

With that in place, it is straightforward to implement both expr and stmt:

Node expr() {
  return expr_rest(value());
}

Node stmt() {
  /* Check lookahead for statements which start with
   * a keyword. Omitted for simplicity.
   */

  /* either first value in an expr or target of assignment */
  Node left = value();

  switch (lookahead) {
    case OP_ASSIGN: 
      accept(lookahead);
      return MakeAssignment(left, expr())
    }
    /* Handle += and other mutating assignments if desired */
    default: {
      /* Not an assignment, just an expression */
      return MakeExpressionStatement(expr_rest(left));
    }
  }
}


来源:https://stackoverflow.com/questions/40777094/how-to-modify-parsing-grammar-to-allow-assignment-and-non-assignment-statements

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!