问题
I am trying to parse some HTML snippets and want to clean them up for various reasons (XSS et al).
I am currently trying to remove all of the attributes on any tag, except for the href on a anchor. I am doing this using a sequence of eregi_replace calls, but I am sure there is a smarter way of doing this using preg_replace and just a couple of lines of code, but I have not been able to get it to work. Can anyone help?
Current code:
$data_item = eregi_replace("<p[^>]*>","<p>", $data_item);
$data_item = eregi_replace("<h2[^>]*>","<h2>", $data_item);
$data_item = eregi_replace("<h3[^>]*>","<h3>", $data_item);
$data_item = eregi_replace("<h4[^>]*>","<h4>", $data_item);
$data_item = eregi_replace("<h5[^>]*>","<h5>", $data_item);
$data_item = eregi_replace("<h6[^>]*>","<h6>", $data_item);
$data_item = eregi_replace("<ul[^>]*>","<ul>", $data_item);
$data_item = eregi_replace("<ol[^>]*>","<ol>", $data_item);
$data_item = eregi_replace("<li[^>]*>","<li>", $data_item);
$data_item = preg_replace("/<a([^>]*)( href=\S+)([^>]*)>/i", '<a$2 rel="nofollow">', $data_item);
(I only need to parse a subset of HTML tags as prior to this I strip out any undesireables).
回答1:
Why not use a general regex that will match any tag, and then preg_replace_callback() to allow you to determine what a given tag should be replaced with? That way you can have a simple function that checks to see if the matched tag was an a
tag, and if so, not replace the href, but otherwise replace everything.
Alternatively, you could do something like this:
$data_item = preg_replace("/<(p|h2|h3|h4|h5|h6|ul|ol)[^>]*>/i","<$1>", $dataitem);
Where the ()
group in the regex captures the type of tag matched, the |
is the "or" operator to match any of the indicated tags, and the $1
in the replacement text is used to substitute in what was matched by the first (and only) capture group from the pattern.
来源:https://stackoverflow.com/questions/1818262/converting-an-eregi-replace-to-a-preg-replace