Representing statement-terminating newlines in a grammar?

让人想犯罪 __ 提交于 2019-12-10 18:36:19

问题


A lot of programming languages have statements terminated by line-endings. Usually, though, line endings are allowed in the middle of a statement if the parser can't make sense of the line; for example,

a = 3 +
4

...will be parsed in Ruby and Python* as the statement a = 3+4, since a = 3+ doesn't make any sense. In other words, the newline is ignored since it leads to a parsing error.

My question is: how can I simply/elegantly accomplish that same behavior with a tokenizer and parser? I'm using Lemon as a parser generator, if it makes any difference (though I'm also tagging this question as yacc since I'm sure the solution applies equally to both programs).

Here's how I'm doing it now: allow a statement terminator to occur optionally in any case where there wouldn't be syntactic ambiguity. In other words, something like

expression ::= identifier PLUS identifier statement_terminator.
expression ::= identifier PLUS statement_terminator identifier statement_terminator.

... in other words, it's ok to use a newline after the plus because that won't have any effect on the ambiguity of the grammar. My worry is that this would balloon the size of the grammar and I have a lot of opportunities to miss cases or introduce subtle bugs in the grammar. Is there an easier way to do this?

EDIT*: Actually, that code example won't work for Python. Python does in fact ignore the newline if you pass in something like this, though:

print (1, 2,
3)

回答1:


You could probably make a parser generator get this right, but it would probably require modifying the parser generator's skeleton.

There are three plausible algorithms I know of; none is perfect.

  1. Insert an explicit statement terminator at the end of the line if:

    a. the previous token wasn't a statement terminator, and

    b. it would be possible to shift the statement terminator.

  2. Insert an explicit statement terminator prior to an unshiftable token (the "offending token", in Ecmascript speak) if:

    a. the offending token is at the beginning of a line, or is a } or is the end-of-input token, and

    b. shifting a statement terminator will not cause a reduction by the empty-statement production. [1]

  3. Make an inventory of all token pairs. For every token pair, decide whether it is appropriate to replace a line-end with a statement terminator. You might be able to compute this table by using one of the above algorithms.

Algorithm 3 is the easiest to implement, but the hardest to work out. And you may need to adjust the table every time you modify the grammar, which will considerably increase the difficulty of modifying the grammar. If you can compute the table of token pairs, then inserting statement terminators can be handled by the lexer. (If your grammar is an operator precedence grammar, then you can insert a statement terminator between any pair of tokens which do not have a precedence relationship. However, even then you may wish to make some adjustments for restricted contexts.)

Algorithms 1 and 2 can be implemented in the parser if you can query the parser about the shiftability of a token without destroying the context. Recent versions of bison allow you to specify what they call "LAC" (LookAhead Correction), which involves doing just that. Conceptually, the parser stack is copied and the parser attempts to handle a token; if the token is eventually shifted, possibly after some number of reductions, without triggering an error production, then the token is part of the valid lookahead. I haven't looked at the implementation, but it's clear that it's not actually necessary to copy the stack to compute shiftability. Regardless, you'd have to reverse-engineer the facility into Lemon if you wanted to use it, which would be an interesting exercise, probably not too difficult. (You'd also need to modify the bison skeleton to do this, but it might be easier starting with the LAC implementation. LAC is currently only used by bison to generate better error messages, but it does involve testing shiftability of every token.)

One thing to watch out for, in all of the above algorithms, is statements which may start with parenthesized expressions. Ecmascript, in particular, gets this wrong (IMHO). The Ecmascript example, straight out of the report:

a = b + c
(d + e).print()

Ecmascript will parse this as a single statement, because c(d + e) is a syntactically valid function call. Consequently, ( is not an offending token, because it can be shifted. It's pretty unlikely that the programmer intended that, though, and no error will be produced until the code is executed, if it is executed.

Note that Algorithm 1 would have inserted a statement terminator at the end of the first line, but similarly would not flag the ambiguity. That's more likely to be what the programmer intended, but the unflagged ambiguity is still annoying.

Lua 5.1 would treat the above example as an error, because it does not allow new lines in between the function object and the ( in a call expression. However, Lua 5.2 behaves like Ecmascript.

Another classical ambiguity is return (and possibly other statements) which have an optional expression. In Ecmascript, return <expr> is a restricted production; a newline is not permitted between the keyword and the expression, so a return at the end of a line has a semicolon automatically inserted. In Lua, it's not ambiguous because a return statement cannot be followed by another statement.


Notes:

  1. Ecmascript also requires that the statement terminator token be parsed as a statement terminator, although it doesn't quite say that; it does not allow the semicolons in the iterator clause of a for statement to be inserted automatically. Its algorithm also includes mandatory semicolon insertion in two context: after a return/throw/continue/break token which appears at the end of a line, and before a ++/-- token which appears at the beginning of a line.


来源:https://stackoverflow.com/questions/17646002/representing-statement-terminating-newlines-in-a-grammar

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!