I\'m trying to learn ANTLR and at the same time use it for a current project.
I\'ve gotten to the point where I can run the lexer on a chunk of code and output it to a C
In ANTLR 4 there is a new facility using parse tree listeners and TokenStreamRewriter (note the name difference) that can be used to observe or transform trees. (The replies suggesting TokenRewriteStream apply to ANTLR 3 and will not work with ANTLR 4.)
In ANTL4 an XXXBaseListener class is generated for you with callbacks for entering and exiting each non-terminal node in the grammar (e.g. enterClassDeclaration() ).
You can use the Listener in two ways:
1) As an observer - By simply overriding the methods to produce arbitrary output related to the input text - e.g. override enterClassDeclaration() and output a line for each class declared in your program.
2) As a transformer using TokenRewriteStream to modify the original text as it passes through. To do this you use the rewriter to make modifications (add, delete, replace) tokens in the callback methods and you use the rewriter and the end to output the modified text.
See the following examples from the ANTL4 book for an example of how to do transformations:
https://github.com/mquinn/ANTLR4/blob/master/book_code/tour/InsertSerialIDListener.java
and
https://github.com/mquinn/ANTLR4/blob/master/book_code/tour/InsertSerialID.java
I've used the sample Java grammar to create an ANTLR script to process an R.java
file and rewrite all the hex values in a decompiled Android app with values of the form R.string.*
, R.id.*
, R.layout.*
and so forth.
The key is using TokenStreamRewriter
to process the tokens and then output the result.
The project (Python) is called RestoreR
I parse with a listener to read in the R.java file and create a mapping from integer to string and then replace the hex values as a I parse the programs java files with a different listener containing a rewriter instance.
class RValueReplacementListener(ParseTreeListener):
replacements = 0
r_mapping = {}
rewriter = None
def __init__(self, tokens):
self.rewriter = TokenStreamRewriter(tokens)
// Code removed for the sake of brevity
# Enter a parse tree produced by JavaParser#integerLiteral.
def enterIntegerLiteral(self, ctx:JavaParser.IntegerLiteralContext):
hex_literal = ctx.HEX_LITERAL()
if hex_literal is not None:
int_literal = int(hex_literal.getText(), 16)
if int_literal in self.r_mapping:
# print('Replace: ' + ctx.getText() + ' with ' + self.r_mapping[int_literal])
self.rewriter.replaceSingleToken(ctx.start, self.r_mapping[int_literal])
self.replacements += 1
The other given example of changing the text in the lexer works well if you want to globally replace the text in all situations, however you often only want to replace a token's text during certain situations.
Using the TokenRewriteStream allows you the flexibility of changing the text only during certain contexts.
This can be done using a subclass of the token stream class you were using. Instead of using the CommonTokenStream
class you can use the TokenRewriteStream
.
So you'd have the TokenRewriteStream consume the lexer and then you'd run your parser.
In your grammar typically you'd do the replacement like this:
/** Convert "int foo() {...}" into "float foo();" */
function
:
{
RefTokenWithIndex t(LT(1)); // copy the location of the token you want to replace
engine.replace(t, "float");
}
type id:ID LPAREN (formalParameter (COMMA formalParameter)*)? RPAREN
block[true]
;
Here we've replaced the token int that we matched with the text float. The location information is preserved but the text it "matches" has been changed.
To check your token stream after you would use the same code as before.
ANTLR has a way to do this in it's grammar file.
Let's say you're parsing a string consisting of numbers and strings delimited by comma's. A grammar would look like this:
grammar Foo;
parse
: value ( ',' value )* EOF
;
value
: Number
| String
;
String
: '"' ( ~( '"' | '\\' ) | '\\\\' | '\\"' )* '"'
;
Number
: '0'..'9'+
;
Space
: ( ' ' | '\t' ) {skip();}
;
This should all look familiar to you. Let's say you want to wrap square brackets around all integer values. Here's how to do that:
grammar Foo;
options {output=template; rewrite=true;}
parse
: value ( ',' value )* EOF
;
value
: n=Number -> template(num={$n.text}) "[<num>]"
| String
;
String
: '"' ( ~( '"' | '\\' ) | '\\\\' | '\\"' )* '"'
;
Number
: '0'..'9'+
;
Space
: ( ' ' | '\t' ) {skip();}
;
As you see, I've added some options
at the top, and added a rewrite rule (everything after the ->
) after the Number
in the value
parser rule.
Now to test it all, compile and run this class:
import org.antlr.runtime.*;
public class FooTest {
public static void main(String[] args) throws Exception {
String text = "12, \"34\", 56, \"a\\\"b\", 78";
System.out.println("parsing: "+text);
ANTLRStringStream in = new ANTLRStringStream(text);
FooLexer lexer = new FooLexer(in);
CommonTokenStream tokens = new TokenRewriteStream(lexer); // Note: a TokenRewriteStream!
FooParser parser = new FooParser(tokens);
parser.parse();
System.out.println("tokens: "+tokens.toString());
}
}
which produces:
parsing: 12, "34", 56, "a\"b", 78
tokens: [12],"34",[56],"a\"b",[78]