问题
TL;DR VERSION: Is there a parser generator that supports the following: when some rule is reduced (I assume LALR(1) parser), then reduction isn't performed, but parser backs off and replaces input with different code using values from this rule and parses that code. Repeat if needed. So if code is "i++" and rule is "expr POST_INCR", I can do more or less:
expr POST_INCR -> "(tmp = expr; expr = expr + 1; tmp)"
So basicaly code rewriting using macros?
LONG VERSION:
I wrote yet another simple interpreted language (in Java for simplicity). It works ok, but it raised some question. Introduction is pretty long, but simple and helps to shows my problem clearly (I think).
I have "while" loop. It is pretty simple, given:
WHILE LPARE boolean_expression RPAREN statement
I generate more or less the following:
new WhileNode(boolean_expression, statement);
This creates new node that, when visited later, generates code for my virtual machine. But I also have the following:
FOR LPAREN for_init_expr SEMICOLON boolean_expression SEMICOLON for_post_expr RPAREN statement
This is "for loop" known from Java or C. From aforementioned rule I create more or less the following:
new ListNode(
for_init_expr,
new WhileNode(
boolean_expression,
new ListNode(
statement,
new ListNode(for_post_expr, null))))
This is of course simple transformation, from:
for (for_init ; boolean_expression ; for_post_expr)
statement
to:
for_init
while (boolean_expression) {
statement
for_post_expr;
}
All is fine and dandy, but things get hairy for the following:
FOR LPAREN var_decl COLON expression RPAREN statement
This if well known and liked:
for (int x : new int[] { 1, 2 })
print(x);
I refrain from posting code that generates AST, since basic for loop was already a little bit long, and what we get here is ever worse. This construction is equal to:
int[] tmp = new int[] { 1, 2 };
for (int it = 0 ; it < tmp.length; it = it + 1) {
int x = tmp[it];
print(x);
}
And since I'm not using types, I simply assume that "expression" (so right side, after COLON) is something that I can iterate over (and arrays are not iterable), I call a function on a result of this "expression" which returns instance of Iterable. So, in fact, my rewritten code isn't as simple as one above, is more or less this:
Iterator it = makeIterable(new int[] { 1, 2 });
while (it.hasNext()) {
int x = it.next();
print(x);
}
It doesn't look THAT bad, but note that AST for this generates three function calls and while loop. To show you what mess it is, I post what I have now:
113 | FOR LPAREN var_decl_name.v PIPE simple_value_field_or_call.o RPAREN statement.s
114 {: Symbol sv = ext(_symbol_v, _symbol_o);
115 String autoVarName = generateAutoVariableName();
116 Node iter = new StatementEndNode(sv, "",
117 new BinNode(sv, CMD.SET, "=",
118 new VarDeclNode(sv, autoVarName),
119 new CallNode(sv, "()",
120 new BinNode(sv, CMD.DOT, ".",
121 new MkIterNode(sv, o),
122 new PushConstNode(sv, "iterator")))));
123 Node varinit = new StatementEndNode(sv, "",
124 new BinNode(sv, CMD.SET, "=",
125 v,
126 new PushConstNode(sv, "null")));
127 Node hasnext = new CallNode(sv, "()",
128 new BinNode(sv, CMD.DOT, ".",
129 new VarNode(sv, autoVarName),
130 new PushConstNode(sv, "hasNext")));
131 Node vargennext = new StatementEndNode(sv, "",
132 new BinNode(sv, CMD.SET, "=",
133 new VarNode(sv, v.name),
134 new CallNode(sv, "()",
135 new BinNode(sv, CMD.DOT, ".",
136 new VarNode(sv, autoVarName),
137 new PushConstNode(sv, "next")))));
138 return new ListNode(sv, "",
139 new ListNode(sv, "",
140 new ListNode(sv, "",
141 iter
142 ),
143 varinit
144 ),
145 new WhileNode(sv, "while",
146 hasnext,
147 new ListNode(sv, "",
148 new ListNode(sv, "",
149 vargennext
150 ),
151 s)));
To answer your questions: yes, I am ashamed of this code.
QUESTION: Is there are parser generator that let's me do something about it, namely given rule:
FOR LPAREN var_decl COLON expr RPAREN statement
tell parser to rewrite it as if it was something else. I imagine that this would require some kind of LISP's macro mechanism (which is easy in lisp due to basically lack of grammar whatsoever), maybe similar to this:
FOR LPAREN var_decl COLON expr RPAREN statement =
{ with [ x = generateAutoName(); ]
emit [ "Iterator $x = makeIterable($expr).iterator();"
"while (${x}.hasNext()) {"
"$var_decl = ${x}.next();"
"$statement"
"}"
]
}
I don't know if this is a well known problem or not, I simply don't even know what to look for - the most similar question that I found is this one: Any software for pattern-matching and -rewriting source code? but it isn't anywhere close to what I need, since it is supposed to work as a separate step and not during compilation, so it doesn't qualify.
Any help will be appreciated.
回答1:
I think you are trying to bend the parser too much. You can simply build the tree with macro in it, and then post-process the tree to replace the macros with whatever substitution you want.
You can do this by walking the resulting tree, detecting the macro nodes (or places where you want to do substitutions), and simply splicing in replacements with procedural tree hackery. Not pretty but workable. You should able to do this with the result of any parser generator/AST building machinery.
If you want a more structured approach, you could build your AST and then use source-to-source transformations to "rewrite" the macros to their content. Out DMS Software Reengineering Toolkit can do this, you can read more details about what the transforms look like.
Using the DMS approach, your concept:
expr POST_INCR -> "(tmp = expr; expr = expr + 1; tmp)"
requires that you parse the original text in the conventional way with the grammar rule:
term = expr POST_INCR ;
You would give all these grammar rules to DMS and let it parse the source and build your AST according to the grammar.
Then you apply a DMS rewrite to the resulting tree:
domain MyToyLanguage; -- tells DMS to use your grammar to process the rule below
rule replace_POST_INCR(e: expr): term->term
= "\e POST_INCR" -> " (tmp = \e; \e = \e + 1; tmp) ";
The quote marks here are "domain meta quotes" rather than string literal quotes. The text outside the double-quotes is DMS rule-syntax. The text inside the quotes is syntax from your language (MyToyLangauge), and is parsed using the parser you provided, some special escapes for pattern variables like \e. (You don't have to do anything to your grammar to get this pattern-parsing capability; DMS takes care of that).
By convention with DMS, we often name literal tokens like POST_INCR with a quoted equivalent '++' in the lexer, rather than using such a name. Instead of
#token POST_INCR "\+\+"
the lexer rule then looks like:
#token '++' "\+\+"
If you do that, then your grammar rule reads like:
term = expr '++' ;
and your rewrite rule now looks like:
rule replace_POST_INCR(e: expr): term->term
= "\e++" -> " (tmp = \e; \e = \e + 1; tmp) ";
Following this convention, the grammar (lexer and BNF) is (IMHO) a lot more readable, and the rewrite rules are more readable too, since they stay extremely close to the actual language syntax.
回答2:
Perhaps you are looking for something like ANTLR's tree-rewriting rules.
You could probably make your AST construction syntax more readable by defining some helper functions. To my eye, there is a lot of redundancy (why do you need both an enumeration and a character string for an operator?) but I'm not a Java programmer.
One approach you might take:
Start with your parser, which already produces an AST. Add a lexical syntax or two to handle template arguments and gensyms. Then write an AST walker which serializes the AST into the code (either Java or bytecode) needed to regenerate the AST. Using that, you can generate the macro templates functions using your own parser, which means that it will automatically stay in sync with any changes you might make to your AST.
来源:https://stackoverflow.com/questions/36494637/source-code-parsing-and-macro-like-handling-of-similar-constructions