How to catch nested {% if … %}{% endif %} statments with regex

你离开我真会死。 提交于 2019-12-02 14:23:54

Don't use regexen, use the existing Twig parser. Here's a sample of an extractor I wrote which parses for custom tags and extracts them: https://github.com/deceze/Twig-extensions/tree/master/lib/Twig/Extensions/Extension/Gettext

The job of the lexer is to turn Twig source code into objects; you can extend it if you need to hook into that process:

class My_Twig_Lexer extends Twig_Lexer {

    ...

    /**
     * Overrides lexComment by saving comment tokens into $this->commentTokens
     * instead of just ignoring them.
     */
    protected function lexComment() {
        if (!preg_match($this->regexes['lex_comment'], $this->code, $match, PREG_OFFSET_CAPTURE, $this->cursor)) {
            throw new Twig_Error_Syntax('Unclosed comment', $this->lineno, $this->filename);
        }
        $value = substr($this->code, $this->cursor, $match[0][1] - $this->cursor);
        $token = new Twig_Extensions_Extension_Gettext_Token(Twig_Extensions_Extension_Gettext_Token::COMMENT, $value, $this->lineno);
        $this->commentTokens[] = $token;
        $this->moveCursor($value . $match[0][0]);
    }

    ...

}

Typically Twig comment nodes are being discarded by Twig, this lexer saves them.

However, your main concern will be to work with the parser:

$twig   = new Twig_Environment(new Twig_Loader_String);
$lexer  = new My_Twig_Lexer($twig);
$parser = new Twig_Parser($twig);

$source = file_get_contents($file);
$tokens = $lexer->tokenize($source);
$node   = $parser->parse($tokens);
processNode($node);

$node here is the root node of a tree of nodes which represent the Twig source in an object oriented fashion, all correctly parsed already. You just need to process this tree without having to worry about the exact syntax which was used to produce it:

 processNode(Twig_NodeInterface $node) {
      switch (true) {
          case $node instanceof Twig_Node_Expression_Function :
              processFunctionNode($node);
              break;
          case $node instanceof Twig_Node_Expression_Filter :
              processFilterNode($node);
              break;
      }

      foreach ($node as $child) {
          if ($child instanceof Twig_NodeInterface) {
              processNode($child);
          }
      }
 }

Just traverse it until you find the kind of node you're looking for and get its information. Play around with it a bit. This example code may or may not be a bit outdated, you'll have to dig into the Twig parser source code anyway to understand it.

It is almost trivial to change your pattern into a recursive pattern:

{% if(.+?) %}((?>(?R)|.)*?){% endif %}

Working example: https://regex101.com/r/gX8rM0/1

However, that would be a bad idea: the pattern is missing many cases, which are really bugs in your parser. Just a few common examples:

  • Comments:

    {% if aaa %}
    123
    <!-- {% endif %} -->
    {% endif %}
    
  • String literals:

    {% if aaa %}a = "{% endif %}"{% endif %}
    
    {% if $x == "{% %}" %}...{% endif %}
    
  • Escaped characters (you do need escaped characters, right?):

    <p>To start a condition, use <code>\{% if aaa %}</code></p>
    
  • Invalid input:
    It would be nice if the parser can work relatively well on invalid input, and point to the correct position of the error.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!