Different lexer rules in different state

不羁的心 提交于 2019-12-03 00:40:37
Bart Kiers

You could let lexer rules match using gated semantic predicates where you test for a certain boolean expression.

A little demo:

freemarker_simple.g

grammar freemarker_simple;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@parser::members {

  // merge a given list of tokens into a single AST
  private CommonTree merge(List tokenList) {
    StringBuilder b = new StringBuilder();
    for(int i = 0; i < tokenList.size(); i++) {
      Token token = (Token)tokenList.get(i);
      b.append(token.getText());
    }
    return new CommonTree(new CommonToken(RAW, b.toString()));
  }
}

@lexer::members {
  private boolean mmode = false;
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

raw_block
  :  (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)* 
  ;

atom
  :  STRING
  |  ID
  ;

// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});

// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END    : {mmode}?=> '}' {mmode=false;};
TAG_END       : {mmode}?=> '>' {mmode=false;};

// valid tokens only when in "markup mode"
EQUALS        : {mmode}?=> '==';
IF            : {mmode}?=> 'if';
STRING        : {mmode}?=> '"' ~'"'* '"';
ID            : {mmode}?=> ('a'..'z' | 'A'..'Z')+;
SPACE         : {mmode}?=> (' ' | '\t' | '\r' | '\n')+ {skip();};

RAW           : . ;

which parses your input:

test.html

${abc}
<html> 
<head> 
  <title>Welcome!</title> 
</head> 
<body> 
  <h1> 
    Welcome ${user}<#if user == "Big Joe">, our beloved leader</#if>! 
  </h1> 
  <p>Our latest product: <a href="${latestProduct}">${latestProduct}</a>!</p>
</body> 
</html>

into the following AST:

as you can test yourself with the class:

Main.java

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    freemarker_simpleLexer lexer = new freemarker_simpleLexer(new ANTLRFileStream("test.html"));
    freemarker_simpleParser parser = new freemarker_simpleParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

EDIT 1

When I run your example input with a parser generated from the second grammar you posted, the following are wthe first 5 lines being printed to the console (not counting the many warnings that are generated):

ID abc 2
RAW 
 0
RAW < 0
ID html 0
...

EDIT 2

Bood wrote:

Also tried the 2nd approach with Bart's grammar, still didn't work the 'html' is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn't RAW get matched first? Or the lexer still chooses the longest match here?

Yes, that is correct: ANTLR chooses the longer match in that case.

But now that I (finally :)) see what you're trying to do, here's a last proposal: you could let the RAW rule match characters as long as the rule can't see one of the following character sequences ahead: "<#", "</#" or "${". Note that the rule must still stay at the end in the grammar. This check is performed inside the lexer. Also, in that case you don't need the merge(...) method in the parser:

grammar freemarker_simple;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@lexer::members {

  private boolean mmode = false;

  private boolean rawAhead() {
    if(mmode) return false;
    int ch1 = input.LA(1), ch2 = input.LA(2), ch3 = input.LA(3);
    return !(
        (ch1 == '<' && ch2 == '#') ||
        (ch1 == '<' && ch2 == '/' && ch3 == '#') ||
        (ch1 == '$' && ch2 == '{')
    );
  }
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  RAW
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END RAW TAG_END_START IF TAG_END -> ^(IF expression RAW)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)*
  ;

atom
  :  STRING
  |  ID
  ;

OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});

OUTPUT_END    : '}' {mmode=false;};
TAG_END       : '>' {mmode=false;};

EQUALS        : '==';
IF            : 'if';
STRING        : '"' ~'"'* '"';
ID            : ('a'..'z' | 'A'..'Z')+;
SPACE         : (' ' | '\t' | '\r' | '\n')+ {skip();};

RAW           : ({rawAhead()}?=> . )+;

The grammar above will produce the following AST from the input posted at the start of this answer:

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!