Regex / DOMDocument - match and replace text not in a link

后端 未结 7 1097
轮回少年
轮回少年 2020-12-01 07:06

I need to find and replace all text matches in a case insensitive way, unless the text is within an anchor tag - for example:

Match this text and re

相关标签:
7条回答
  • 2020-12-01 07:25
    $a='<p>Match this text and replace it</p>
    <p>Don\'t <a href="/">match this text</a></p>
    <p>We still need to match this text and replace it</p>';
    
    echo preg_replace('~match this text(?![^<]*</a>)~i','replacement',$a);
    

    The negative lookahead ensures the replacement happens only if the next tag is not a closing link . It works fine with your example, though it won't work if you happen to use other tags inside your links.

    0 讨论(0)
  • 2020-12-01 07:27

    You can use PHP Simple HTML DOM Parser. It is similar to DOMDocument, but in my opinion it's simpler to use. Here is the alternative in parallel with Netcoder's DomDocument solution:

    function replaceWithSimpleHtmlDom($html_content, $search, $replace, $excludedParents = array()) {
        require_once('simple_html_dom.php');
        $html = str_get_html($html_content);
        foreach ($html->find('text') as $element) {
            if (!in_array($element->parent()->tag, $excludedParents))
                $element->innertext = str_ireplace($search, $replace, $element->innertext);
        }
        return (string)$html;
    }
    

    I have just profiled this code against my DomDocument solution (witch prints the exact same output), and the DomDocument is (not surprisingly) way faster (~4ms against ~77ms).

    0 讨论(0)
  • 2020-12-01 07:31

    This is the stackless non-recursive approach using pre-order traversal of the DOM tree.

      libxml_use_internal_errors(TRUE);
      $dom=new DOMDocument('1.0','UTF-8');
    
      $dom->substituteEntities=FALSE;
      $dom->recover=TRUE;
      $dom->strictErrorChecking=FALSE;
    
      $dom->loadHTMLFile($file);
      $root=$dom->documentElement;
      $node=$root;
      $flag=FALSE;
      for (;;) {
          if (!$flag) {
              if ($node->nodeType==XML_TEXT_NODE &&
                  $node->parentNode->tagName!='a') {
                  $node->nodeValue=preg_replace(
                      '/match this text/is',
                      $replacement, $node->nodeValue
                  );
              }
              if ($node->firstChild) {
                  $node=$node->firstChild;
                  continue;
              }
         }
         if ($node->isSameNode($root)) break;
         if ($flag=$node->nextSibling)
              $node=$node->nextSibling;
         else
              $node=$node->parentNode;
     }
     echo $dom->saveHTML();
    

    libxml_use_internal_errors(TRUE); and the 3 lines of code after $dom=new DOMDocument; should be able to handle any malformed HTML.

    0 讨论(0)
  • 2020-12-01 07:32

    Try this one:

    $dom = new DOMDocument;
    $dom->loadHTML($html_content);
    
    function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
      if (!empty($dom->childNodes)) {
        foreach ($dom->childNodes as $node) {
          if ($node instanceof DOMText && 
              !in_array($node->parentNode->nodeName, $excludeParents)) 
          {
            $node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
          } 
          else
          {
            preg_replace_dom($regex, $replacement, $node, $excludeParents);
          }
        }
      }
    }
    
    preg_replace_dom('/match this text/i', 'IT WORKS', $dom->documentElement, array('a'));
    
    0 讨论(0)
  • 2020-12-01 07:33

    HTML parsing with regexs is a huge challenge, and they can very easily end up getting too complex and taking up loads of memory. I would say the best way is to do this:

    preg_replace('/match this text/i','replacement text');
    preg_replace('/(<a[^>]*>[^(<\/a)]*)replacement text(.*?<\/a)/is',"$1match this text$3");
    

    If your replacement text is something which might occur otherwise, you might want to add an intermediate step with some unique identifier.

    0 讨论(0)
  • Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.

    The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).

    The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.

    <?php
    $html = '<p>Match this text and replace it</p>
    <p>Don\'t <a href="/">match this text</a></p>
    <p>We still need to match this text and replace itŐŰ</p>
    <p>This is <a href="#">a link <span>with <strong>don\'t match this text</strong> content</span></a></p>';
    
    $dom = new DOMDocument();
    // loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
    $dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
    
    $xpath = new DOMXPath($dom);
    
    foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
    {
        $replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
        $newNode  = $dom->createDocumentFragment();
        $newNode->appendXML($replaced);
        $node->parentNode->replaceChild($newNode, $node);
    }
    
    // get only the body tag with its contents, then trim the body tag itself to get only the original content
    echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
    

    References:
    1. find and replace keywords by hyperlinks in an html fragment, via php dom
    2. Regex / DOMDocument - match and replace text not in a link
    3. php problem with russian language
    4. Why Does DOM Change Encoding?

    I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).

    Thanks for Gordon and stillstanding for commenting on my other answer.

    0 讨论(0)
提交回复
热议问题