Different lexer rules in different state

前端未结

关注

 1  451

悲哀的现实 2021-02-04 20:12

I\'ve been working on a parser for some template language embeded in HTML (FreeMarker), piece of example here:

${abc}
 
 
           
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<ins class="adsbygoogle"
     style="display:block"
     data-ad-client="ca-pub-5408099190056760"
     data-ad-slot="7305827575"
     data-ad-format="auto"
     data-full-width-responsive="true"></ins>
<script>
     (adsbygoogle = window.adsbygoogle || []).push({});
</script>        </div>
      </div>
      
      <div class="fly-panel detail-box" id="flyReply">
        <fieldset class="layui-elem-field layui-field-title" style="text-align: center;">
          <legend>1条回答</legend>        </fieldset>

        <ul class="jieda" id="jieda">
                    <li data-id="111" class="jieda-daan">
            <a name="item-1111111111"></a>
            <div class="detail-about detail-about-reply">
                         <a class="fly-avatar" href="">
                <img src="https://www.e-learn.cn/qa/data/avatar/000/00/00/small_000000043.jpg" alt=" 小蘑菇 ">
              </a>
              <div class="fly-detail-user">
                <a href="" class="fly-link">
                  <cite> 小蘑菇</cite>
                                             
                </a>
                
                <span>(楼主)</span>
            
              </div>              <div class="detail-hits">
                <span>2021-02-04 20:55</span>
              </div>

            </div>
            <div class="detail-body jieda-body photos">
              <p>          
<p>You could let lexer rules match using gated semantic predicates where you test for a certain boolean expression.</p>
<p>A little demo:</p>
<h3>freemarker_simple.g</h3>

<pre><code>grammar freemarker_simple;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@parser::members {

  // merge a given list of tokens into a single AST
  private CommonTree merge(List tokenList) {
    StringBuilder b = new StringBuilder();
    for(int i = 0; i < tokenList.size(); i++) {
      Token token = (Token)tokenList.get(i);
      b.append(token.getText());
    }
    return new CommonTree(new CommonToken(RAW, b.toString()));
  }
}

@lexer::members {
  private boolean mmode = false;
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

raw_block
  :  (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)* 
  ;

atom
  :  STRING
  |  ID
  ;

// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});

// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END    : {mmode}?=> '}' {mmode=false;};
TAG_END       : {mmode}?=> '>' {mmode=false;};

// valid tokens only when in "markup mode"
EQUALS        : {mmode}?=> '==';
IF            : {mmode}?=> 'if';
STRING        : {mmode}?=> '"' ~'"'* '"';
ID            : {mmode}?=> ('a'..'z' | 'A'..'Z')+;
SPACE         : {mmode}?=> (' ' | '\t' | '\r' | '\n')+ {skip();};

RAW           : . ;
</code></pre>
<p>which parses your input:</p>
<h3>test.html</h3>

<pre><code>${abc}
<html> 
<head> 
  <title>Welcome! 
 
 
   
    Welcome ${user}<#if user == "Big Joe">, our beloved leader! 
   
  Our latest product: ${latestProduct}!

into the following AST:

enter image description here

as you can test yourself with the class:

Main.java

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    freemarker_simpleLexer lexer = new freemarker_simpleLexer(new ANTLRFileStream("test.html"));
    freemarker_simpleParser parser = new freemarker_simpleParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

EDIT 1

When I run your example input with a parser generated from the second grammar you posted, the following are wthe first 5 lines being printed to the console (not counting the many warnings that are generated):

ID abc 2
RAW 
 0
RAW < 0
ID html 0
...

EDIT 2

Bood wrote:

Also tried the 2nd approach with Bart's grammar, still didn't work the 'html' is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn't RAW get matched first? Or the lexer still chooses the longest match here?

Yes, that is correct: ANTLR chooses the longer match in that case.

But now that I (finally :)) see what you're trying to do, here's a last proposal: you could let the RAW rule match characters as long as the rule can't see one of the following character sequences ahead: "<#", " or "${". Note that the rule must still stay at the end in the grammar. This check is performed inside the lexer. Also, in that case you don't need the merge(...) method in the parser:


grammar freemarker_simple;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@lexer::members {
  
  private boolean mmode = false;
  
  private boolean rawAhead() {
    if(mmode) return false;
    int ch1 = input.LA(1), ch2 = input.LA(2), ch3 = input.LA(3);
    return !(
        (ch1 == '<' && ch2 == '#') ||
        (ch1 == '<' && ch2 == '/' && ch3 == '#') ||
        (ch1 == '$' && ch2 == '{')
    );
  }
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  RAW
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END RAW TAG_END_START IF TAG_END -> ^(IF expression RAW)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)*
  ;

atom
  :  STRING
  |  ID
  ;

OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '' {mmode=false;};

EQUALS        : '==';
IF            : 'if';
STRING        : '"' ~'"'* '"';
ID            : ('a'..'z' | 'A'..'Z')+;
SPACE         : (' ' | '\t' | '\r' | '\n')+ {skip();};

RAW           : ({rawAhead()}?=> . )+;

The grammar above will produce the following AST from the input posted at the start of this answer: