Different lexer rules in different state

前端 未结 1 453
悲哀的现实
悲哀的现实 2021-02-04 20:12

I\'ve been working on a parser for some template language embeded in HTML (FreeMarker), piece of example here:

${abc}
 
 
           
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<ins class="adsbygoogle"
     style="display:block"
     data-ad-client="ca-pub-5408099190056760"
     data-ad-slot="7305827575"
     data-ad-format="auto"
     data-full-width-responsive="true"></ins>
<script>
     (adsbygoogle = window.adsbygoogle || []).push({});
</script>        </div>
                      <div class="relativetags">
              相关标签:
       </div>
      </div>
      
      <div class="fly-panel detail-box" id="flyReply">
        <fieldset class="layui-elem-field layui-field-title" style="text-align: center;">
          <legend>1条回答</legend>        </fieldset>

        <ul class="jieda" id="jieda">
                         				            <li data-id="111">
            <a name="item-1111111111"></a>
           
            <div class="detail-about detail-about-reply">
                              <a class="fly-avatar" href="https://www.e-learn.cn/qa/u-43.html">
                <img src="https://www.e-learn.cn/qa/data/avatar/000/00/00/small_000000043.jpg" alt=" ">
              </a>
              <div class="fly-detail-user">
                <a href="https://www.e-learn.cn/qa/u-43.html" class="fly-link">
                  <cite>小蘑菇 </cite>       
                </a>
              </div>
                            <div class="detail-hits">
                <span>2021-02-04 20:55</span>
              </div>
            </div>
            <div class="detail-body jieda-body photos">
                                                                       
<p>You could let lexer rules match using gated semantic predicates where you test for a certain boolean expression.</p>
<p>A little demo:</p>
<h3>freemarker_simple.g</h3>

<pre><code>grammar freemarker_simple;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@parser::members {

  // merge a given list of tokens into a single AST
  private CommonTree merge(List tokenList) {
    StringBuilder b = new StringBuilder();
    for(int i = 0; i < tokenList.size(); i++) {
      Token token = (Token)tokenList.get(i);
      b.append(token.getText());
    }
    return new CommonTree(new CommonToken(RAW, b.toString()));
  }
}

@lexer::members {
  private boolean mmode = false;
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

raw_block
  :  (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)* 
  ;

atom
  :  STRING
  |  ID
  ;

// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});

// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END    : {mmode}?=> '}' {mmode=false;};
TAG_END       : {mmode}?=> '>' {mmode=false;};

// valid tokens only when in "markup mode"
EQUALS        : {mmode}?=> '==';
IF            : {mmode}?=> 'if';
STRING        : {mmode}?=> '"' ~'"'* '"';
ID            : {mmode}?=> ('a'..'z' | 'A'..'Z')+;
SPACE         : {mmode}?=> (' ' | '\t' | '\r' | '\n')+ {skip();};

RAW           : . ;
</code></pre>
<p>which parses your input:</p>
<h3>test.html</h3>

<pre><code>${abc}
<html> 
<head> 
  <title>Welcome!</title> 
</head> 
<body> 
  <h1> 
    Welcome ${user}<#if user == "Big Joe">, our beloved leader</#if>! 
  </h1> 
  <p>Our latest product: <a href="${latestProduct}">${latestProduct}</a>!</p>
</body> 
</html>
</code></pre>
<p>into the following AST:</p>
<p><img src="https://i.stack.imgur.com/qd1vu.png" alt="enter image description here" /></p>
<p>as you can test yourself with the class:</p>
<h3>Main.java</h3>

<pre><code>import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    freemarker_simpleLexer lexer = new freemarker_simpleLexer(new ANTLRFileStream("test.html"));
    freemarker_simpleParser parser = new freemarker_simpleParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}
</code></pre>
<hr />
<h2>EDIT 1</h2>
<p>When I run your example input with a parser generated from the second grammar you posted, the following are wthe first 5 lines being printed to the console (not counting the many warnings that are generated):</p>

<pre><code>ID abc 2
RAW 
 0
RAW < 0
ID html 0
...
</code></pre>
<hr />
<h2>EDIT 2</h2>
<blockquote>
<p><em><strong>Bood wrote:</strong></em></p>
<p>Also tried the 2nd approach with Bart's grammar, still didn't work the 'html' is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn't RAW get matched first? Or the lexer still chooses the longest match here?</p>
</blockquote>
<p>Yes, that is correct: ANTLR chooses the longer match in that case.</p>
<p>But now that I (finally :)) see what you're trying to do, here's a last proposal: you could let the <code>RAW</code> rule match characters as long as the rule can't see one of the following character sequences ahead: <code>"<#"</code>, <code>"</#"</code> or <code>"${"</code>. Note that the rule must still stay at the end in the grammar. This check is performed inside the lexer. Also, in that case you don't need the <code>merge(...)</code> method in the parser:</p>
<pre class="lang-java prettyprint-override"><code>grammar freemarker_simple;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@lexer::members {
  
  private boolean mmode = false;
  
  private boolean rawAhead() {
    if(mmode) return false;
    int ch1 = input.LA(1), ch2 = input.LA(2), ch3 = input.LA(3);
    return !(
        (ch1 == '<' && ch2 == '#') ||
        (ch1 == '<' && ch2 == '/' && ch3 == '#') ||
        (ch1 == '$' && ch2 == '{')
    );
  }
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  RAW
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END RAW TAG_END_START IF TAG_END -> ^(IF expression RAW)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)*
  ;

atom
  :  STRING
  |  ID
  ;

OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});

OUTPUT_END    : '}' {mmode=false;};
TAG_END       : '>' {mmode=false;};

EQUALS        : '==';
IF            : 'if';
STRING        : '"' ~'"'* '"';
ID            : ('a'..'z' | 'A'..'Z')+;
SPACE         : (' ' | '\t' | '\r' | '\n')+ {skip();};

RAW           : ({rawAhead()}?=> . )+;
</code></pre>
<p>The grammar above will produce the following AST from the input posted at the start of this answer:</p>
<p><img src="https://i.stack.imgur.com/uKjsn.png" alt="enter image description here" /></p>
                                                                        <div class="appendcontent">
                                                        </div>
            </div>
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='2067184'>
                <i class="iconfont icon-zan"></i>
                <em>0</em>
              </span>
                 <span type="reply" class="showpinglun" data-id="2067184">
                <i class="iconfont icon-svgmoban53"></i>
               讨论(0)
              </span>
              
                                                   
              <div class="jieda-admin">
                                                            </div>
            </div>
                      <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_2067184">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
               
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    </div>
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="2067184">提交评论 </span></div>
  </div>
  
</form>
                    </div>
                    <hr>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='https://www.e-learn.cn/qa/static/css/default/loading.gif' align='absmiddle' />
                         加载中...
                        </li>
                    </ul>
                </div>
          </li>
          	          
           <style>
   
.laypage-main a, .laypage-main span {
    display: inline-block;
}
        </style>                  </ul>
        
        <div class="layui-form layui-form-pane">
          <form id="huidaform"  name="answerForm"  method="post">
            
            <div class="layui-form-item layui-form-text">
              <a name="comment"></a>
              <div class="layui-input-block">
            
    
<script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.config.js"></script>
<script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.all.js"></script>
<script type="text/plain" id="editor"  name="content"  style="width:100%;height:200px;"></script>                                 
<script type="text/javascript">
                                 var isueditor=1;
            var editor = UE.getEditor('editor',{
                //这里可以选择自己需要的工具按钮名称,此处仅选择如下五个
                toolbars:[['source','fullscreen',  '|', 'undo', 'redo', '|', 'bold', 'italic', 'underline', 'fontborder', 'strikethrough', 'removeformat', 'formatmatch', 'autotypeset', 'blockquote', 'pasteplain', '|', 'forecolor', 'backcolor', 'insertorderedlist', 'insertunorderedlist', 'selectall', 'cleardoc', '|', 'rowspacingtop', 'rowspacingbottom', 'lineheight', '|', 'customstyle', 'paragraph', 'fontfamily', 'fontsize', '|', 'indent', '|', 'justifyleft', 'justifycenter', 'justifyright', 'justifyjustify', '|', 'link', 'unlink', 'anchor', '|', 'simpleupload', 'insertimage', 'scrawl', 'insertvideo', 'attachment', 'map', 'insertcode', '|', 'horizontal', '|', 'preview', 'searchreplace', 'drafts']],
            
                initialContent:'',
                //关闭字数统计
                wordCount:false,
                zIndex:2,
                //关闭elementPath
                elementPathEnabled:false,
                //默认的编辑区域高度
                initialFrameHeight:250
                //更多其他参数,请参考ueditor.config.js中的配置项
                //更多其他参数,请参考ueditor.config.js中的配置项
            });
                        editor.ready(function() {
            	editor.setDisabled();
            	});
                            $("#editor").find("*").css("max-width","362px");
        </script>              </div>
            </div>
                          
    

        
         <div class="layui-form-item">
                <label for="L_vercode" class="layui-form-label">验证码</label>
                <div class="layui-input-inline">
                  <input type="text"  id="code" name="code"   value="" required lay-verify="required" placeholder="图片验证码" autocomplete="off" class="layui-input">
                </div>
                <div class="layui-form-mid">
                  <span style="color: #c00;"><img class="hand" src="https://www.e-learn.cn/qa/user/code.html" onclick="javascript:updatecode();" id="verifycode"><a class="changecode"  href="javascript:updatecode();"> 看不清?</a></span>
                </div>
              </div>
                                  <div class="layui-form-item">
                    <input type="hidden" value="1015321" id="ans_qid" name="qid">
   <input type="hidden" id="tokenkey" name="tokenkey" value=''/>
                <input type="hidden" value="Different lexer rules in different state" id="ans_title" name="title"> 
             
              <div class="layui-btn    layui-btn-disabled"  id="ajaxsubmitasnwer" >提交回复</div>
            </div>
          </form>
        </div>
      </div>
      <input type="hidden" value="1015321" id="adopt_qid"	name="qid" /> 
      <input type="hidden" id="adopt_answer" value="0"	name="aid" />
    </div>
    <div class="layui-col-md4">
          
 <!-- 热门讨论问题 -->
     
 <dl class="fly-panel fly-list-one">
        <dt class="fly-panel-title">热议问题</dt>
            <!-- 本周热门讨论问题显示10条-->