This is what I got now:
/{% if(.+?) %}(.*?){% endif %}/gusi
It catches multiple if statements etc just fine.
IMG: http://image.xesau.eu/2015
Don't use regexen, use the existing Twig parser. Here's a sample of an extractor I wrote which parses for custom tags and extracts them: https://github.com/deceze/Twig-extensions/tree/master/lib/Twig/Extensions/Extension/Gettext
The job of the lexer is to turn Twig source code into objects; you can extend it if you need to hook into that process:
class My_Twig_Lexer extends Twig_Lexer {
...
/**
* Overrides lexComment by saving comment tokens into $this->commentTokens
* instead of just ignoring them.
*/
protected function lexComment() {
if (!preg_match($this->regexes['lex_comment'], $this->code, $match, PREG_OFFSET_CAPTURE, $this->cursor)) {
throw new Twig_Error_Syntax('Unclosed comment', $this->lineno, $this->filename);
}
$value = substr($this->code, $this->cursor, $match[0][1] - $this->cursor);
$token = new Twig_Extensions_Extension_Gettext_Token(Twig_Extensions_Extension_Gettext_Token::COMMENT, $value, $this->lineno);
$this->commentTokens[] = $token;
$this->moveCursor($value . $match[0][0]);
}
...
}
Typically Twig comment nodes are being discarded by Twig, this lexer saves them.
However, your main concern will be to work with the parser:
$twig = new Twig_Environment(new Twig_Loader_String);
$lexer = new My_Twig_Lexer($twig);
$parser = new Twig_Parser($twig);
$source = file_get_contents($file);
$tokens = $lexer->tokenize($source);
$node = $parser->parse($tokens);
processNode($node);
$node
here is the root node of a tree of nodes which represent the Twig source in an object oriented fashion, all correctly parsed already. You just need to process this tree without having to worry about the exact syntax which was used to produce it:
processNode(Twig_NodeInterface $node) {
switch (true) {
case $node instanceof Twig_Node_Expression_Function :
processFunctionNode($node);
break;
case $node instanceof Twig_Node_Expression_Filter :
processFilterNode($node);
break;
}
foreach ($node as $child) {
if ($child instanceof Twig_NodeInterface) {
processNode($child);
}
}
}
Just traverse it until you find the kind of node you're looking for and get its information. Play around with it a bit. This example code may or may not be a bit outdated, you'll have to dig into the Twig parser source code anyway to understand it.
It is almost trivial to change your pattern into a recursive pattern:
{% if(.+?) %}((?>(?R)|.)*?){% endif %}
Working example: https://regex101.com/r/gX8rM0/1
However, that would be a bad idea: the pattern is missing many cases, which are really bugs in your parser. Just a few common examples:
Comments:
{% if aaa %}
123
<!-- {% endif %} -->
{% endif %}
String literals:
{% if aaa %}a = "{% endif %}"{% endif %}
{% if $x == "{% %}" %}...{% endif %}
Escaped characters (you do need escaped characters, right?):
<p>To start a condition, use <code>\{% if aaa %}</code></p>
Invalid input:
It would be nice if the parser can work relatively well on invalid input, and point to the correct position of the error.