Parsing markup into Abstract Syntax Tree using Regular Expression

房东的猫 提交于 2019-12-25 05:01:41


This question is supplementary to: Recursive processing of markup using Regular Expression and DOMDocument

The code supplied by the selected answer has been a great help to understand building a basic syntax tree. However I am now having troubles tightening the regular expressions to only match my syntax rather than {. but not {{. Ideally I would like it to only match my syntax which is:


Two tags, a and small also require differing end tags. I have tried modifying $re_closetag from the original code sample to reflect this but it still matches too much as text.

For example:>} bang 
smäll<} boom 

My test string is:

tëstïng {{ 汉字/漢字 }} testing {<>} bang {>smäll<} boom {* strông{/ ëmphäsïs {- strïkë {| côdë |} -} /} *} {*wôw*} 1, 2, 3


You can either control this in the RE itself or after a match.

In the re, to control what tags may be "open" modify this part of $re_next:

(?:\{(?P<opentag>[^{\s]))  # match an open tag
      #which is "{" followed by anything other than whitespace or another "{"

Currently it looks for any character which is not { or whitespace. Simply change to this:


Now it looks for only your specific open tags.

The close tag portion only matches a single character at a time depending on what tag is open in the current context. (This is what the $opentag argument is for.) So to match a pair of characters, simply change the $opentag to look for in the recursive call. E.g.:

        if (isset($m['opentag']) && $m['opentag'][1] !== -1) {
            list($newopen, $_) = $m['opentag'];

            // change the close character to look for in the new context
            if ($newopen==='>') $newopen = '<';
            else if ($newopen==='<') $newopen = '>';

            list($subast, $offset) = str_to_ast($s, $offset, array(), $newopen);
            $ast[] = array($newopen, $subast);
        } else if (isset($m['text']) && $m['text'][1] !== -1) {

Alternatively, you can keep the RE as-is and decide what to do with the match after the fact. For example, if you match a @ character but {@ is not an allowed open tag, you can either raise a parse error or simply treat it as a text node (attaching array('#text', '{@') to the ast), or anything in between.

