preg_replace only OUTSIDE tags ? (… we're not talking full 'html parsing', just a bit of markdown)

问题

What is the easiest way of applying highlighting of some text excluding text within OCCASIONAL tags "<...>"?

CLARIFICATION: I want the existing tags PRESERVED!

$t = 
preg_replace(
  "/(markdown)/",
  "<strong>$1</strong>",
"This is essentially plain text apart from a few html tags generated with some
simplified markdown rules: <a href=markdown.html>[see here]</a>");

Which should display as:

"This is essentially plain text apart from a few html tags generated with some simplified markdown rules: see here"

... BUT NOT MESS UP the text inside the anchor tag (i.e. <a href=markdown.html> ).

I've heard the arguments of not parsing html with regular expressions, but here we're talking essentially about plain text except for minimal parsing of some markdown code.

回答1:

Actually, this seems to work ok:

<?php
$item="markdown";
$t="This is essentially plain text apart from a few html tags generated 
with some simplified markdown rules: <a href=markdown.html>[see here]</a>";

//_____1. apply emphasis_____
$t = preg_replace("|($item)|","<strong>$1</strong>",$t);

// "This is essentially plain text apart from a few html tags generated 
// with some simplified <strong>markdown</strong> rules: <a href=
// <strong>markdown</strong>.html>[see here]</a>"

//_____2. remove emphasis if WITHIN opening and closing tag____
$t = preg_replace("|(<[^>]+?)(<strong>($item)</strong>)([^<]+?>)|","$1$3$4",$t);

// this preserves the text before ($1), after ($4) 
// and inside <strong>..</strong> ($2), but without the tags ($3)

// "This is essentially plain text apart from a few html tags generated
// with some simplified <strong>markdown</strong> rules: <a href=markdown.html>
// [see here]</a>"

?>

A string like $item="odd|string" would cause some problems, but I won't be using that kind of string anyway... (probably needs htmlentities(...) or the like...)

回答2:

You could split the string into tag‍/‍no-tag parts using preg_split:

$parts = preg_split('/(<(?:[^"\'>]|"[^"<]*"|\'[^\'<]*\')*>)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);

Then you can iterate the parts while skipping every even part (i.e. the tag parts) and apply your replacement on it:

for ($i=0, $n=count($parts); $i<$n; $i+=2) {
    $parts[$i] = preg_replace("/(markdown)/", "<strong>$1</strong>", $parts[$i]);
}

At the end put everything back together with implode:

$str = implode('', $parts);

But note that this is really not the best solution. You should better use a proper HTML parser like PHP’s DOM library. See for example these related questions:

Highlight keywords in a paragraph
Regex / DOMDocument - match and replace text not in a link

回答3:

You could split your string into an array at every '<' or '>' using preg_split(), then loop through that array and replace only in entries not beginning with an '>'. Afterwards you combine your array to an string using implode().

回答4:

This regex should strip all HTML opening and closing tags: /(<[.*?]>)+/

You can use it with preg_replace like this:

$test = "Hello <strong>World!</strong>";
$regex = "/(<.*?>)+/";


$result = preg_replace($regex,"",$test);

回答5:

actually this is not very efficient, but it worked for me

$your_string = '...';

$search = 'markdown';
$left = '<strong>';
$right = '</strong>';

$left_Q = preg_quote($left, '#');
$right_Q = preg_quote($right, '#');
$search_Q = preg_quote($search, '#');
while(preg_match('#(>|^)[^<]*(?<!'.$left_Q.')'.$search_Q.'(?!'.$right_Q.')[^>]*(<|$)#isU', $your_string))
  $your_string = preg_replace('#(^[^<]*|>[^<]*)(?<!'.$left_Q.')('.$search_Q.')(?!'.$right_Q.')([^>]*<|[^>]*$)#isU', '${1}'.$left.'${2}'.$right.'${3}', $your_string);

echo $your_string;

来源：https://stackoverflow.com/questions/4603780/preg-replace-only-outside-tags-were-not-talking-full-html-parsing-jus

标签

php

html

preg-replace

Markdown

markup