Is Antlr suitable for parsing data from streams that don\'t have EOF right after the text to parse? According to my observation, the lexer does not emit the current token until
The ANTLR 4 book was originally going to contain an example of parsing a streaming input, but I argued against it due to the severe complications that will inevitably arise from the use of an adaptive unlimited lookahead parser for something like this.
ANTLR 4 has no guaranteed lookahead bound (and no way to tell it to look for or even attempt to enforce one), so any implementation that operates on a blocking stream has the possibility of deadlock without returning information about the parse leading up to that point. I wouldn't even entertain the possibility of parsing a streaming input unless I saw an intermediate buffer in place first.
String
or char[]
.ANTLRInputStream
for the buffer.The result of the parse will tell you whether to discard the results to that point, or hold on to them to retry when more data is available:
If no syntax error occurs, the input was successfully parsed, and you can parse the next section of input when it becomes available later.
If a syntax error is reported before the EOF token is consumed, then a syntax error appears in the actual input so you'll want to handle it (report it to the user, etc...).
If a syntax error is reported at the point where the EOF token is consumed then additional input may resolve the problem - ignore the results of the current parse, and then retry once more data is available from the input stream.
I think you're using the unbuffered streams correctly and what you see is the expected, desired result of using those streams. But I think you may have expectations of them that they aren't obligated to meet.
Below is test code for us to poke with sticks. I'm using System.in
for the input, so I modified the grammar to account for the newline characters between the word tokens.
grammar Streaming;
fox : 'quick' NL 'brown' NL 'fox' NL DONE NL;
DONE : 'done';
NL : '\r'? '\n';
import org.antlr.v4.runtime.CommonToken;
import org.antlr.v4.runtime.CommonTokenFactory;
import org.antlr.v4.runtime.Token;
import org.antlr.v4.runtime.UnbufferedCharStream;
import org.antlr.v4.runtime.UnbufferedTokenStream;
import org.antlr.v4.runtime.tree.TerminalNode;
public class StreamingTest {
public static void main(String[] args) throws Exception {
lex();
parse();
}
private static void lex() {
System.out.println("-> Reading from lexer:");
UnbufferedCharStream input = new UnbufferedCharStream(System.in);
StreamingLexer lexer = new StreamingLexer(input);
lexer.setTokenFactory(new CommonTokenFactory(true));
Token t;
//read each token until hitting input "done"
while ((t = lexer.nextToken()).getType() != StreamingLexer.DONE){
if (t.getText().trim().length() == 0){
System.out.println("-> " + StreamingLexer.tokenNames[t.getType()]);
} else {
System.out.println("-> " + t.getText());
}
}
}
private static void parse() {
System.out.println("-> Reading from parser:");
UnbufferedCharStream input = new UnbufferedCharStream(System.in);
StreamingLexer lexer = new StreamingLexer(input);
lexer.setTokenFactory(new CommonTokenFactory(true));
StreamingParser parser = new StreamingParser(new UnbufferedTokenStream<CommonToken>(lexer));
parser.addParseListener(new StreamingBaseListener(){
@Override
public void visitTerminal(TerminalNode t) {
if (t.getText().trim().length() == 0){
System.out.println("-> " + StreamingLexer.tokenNames[t.getSymbol().getType()]);
} else {
System.out.println("-> " + t.getText());
}
}
});
parser.fox();
}
}
Below is a mix of the input and output as they're provided to/received from the lexer and parser in the program above. Each line of output is prefixed with ->
. I'll explain why things are the way they are after that.
-> Reading from lexer:
quick
-> quick
brown
-> NL
-> brown
fox
-> NL
-> fox
done
-> NL
-> Reading from parser:
quick
brown
-> quick
-> NL
fox
-> brown
-> NL
done
-> fox
-> NL
-> done
-> NL
The first thing I notice is that the lexer immediately received quick
NL
for input, but only provided a token for quick
. The reason for this discrepancy is that the UnbufferedCharStream
reads ahead one more character (even though it has a perfectly good NL
token ready for me!) because it won't sit on an empty look-ahead character buffer. Alas, the unbuffered stream is buffered. According to the Javadoc comment in the class itself:
"Unbuffered" here refers to fact that it doesn't buffer all data, not that's it's on demand loading of char.
This extra read translates into waiting on the stream for more input, which explains why the lexer is one token behind for the rest of the input.
Now on to the parser. Why does it lag behind two tokens to the lexer's one? Simple: because UnbufferedTokenStream
won't sit on an empty look-ahead buffer either. But it can't create that next token until a) it has a spare token from the lexer and b) the lexer's UnbufferedCharStream
reads its own look-ahead character. In effect, it's the lexer's one-character "lag" plus a one-token "lag."
It appears that getting "lag-free," data-on-demand streams in ANTLR v4 means writing your own. But it seems to me that the existing streams work as expected.
Is Antlr suitable for parsing data from streams that don't have EOF right after the text to parse?
I can't answer that with confidence for ANTLR 4 yet. It seems easy enough to write a token stream that doesn't buffer ahead until it's needed (override UnbufferedTokenStream
's consume
to skip calling sync
), but the character stream gets called by classes that do their own reading ahead regardless of anyone's buffering. Or so it seems. I'll keep digging into this as best I can, but it may require learning the official way to do this.
Apparently the root of the issue is not in Unbuffered*Streams. It's in Interpreters, like LexerATNSimulator.execATN() method. That method interprets the lexer as a state machine, moving from one tag to another once the first character of next tag is consumed. The similar algorithm is used in ParserATNSimulator, which deals with Tokens recognized by the Lexer. That's what causes that double lag. So, now I'm pretty much confident that Antlr 4 as it's implemented now cannot be used for parsing continuous, interactive data. Unlike Flex/Bison, where the lexer returns the tag right when the last characters possibly matching the tag. As the result - the parse() function ends right when the portion of data matching the grammar arrives. That provides nice ability to read exact amount of data, determined by the data structure when the size is not defined otherwise.