I\'m currently using PHP and a regular expression to strip out all HTML comments from a page. The script works well... a little too well. It strips out all comments includin
I'm not sure if PHP's regex engine will like the following, but try this pattern:
'/<!--(.|\s)*(\[if .*\]){0}(.|\s)*?-->/'
Since comments cannot be nested in HTML, a regex can do the job, in theory. Still, using some kind of parser would be the better choice, especially if your input is not guaranteed to be well-formed.
Here is my attempt at it. To match only normal comments, this would work. It has become quite a monster, sorry for that. I have tested it quite extensively, it seems to do it well, but I give no warranty.
<!--(?!\s*(?:\[if [^\]]+]|<!|>))(?:(?!-->).)*-->
Explanation:
<!-- #01: "<!--"
(?! #02: look-ahead: a position not followed by:
\s* #03: any number of space
(?: #04: non-capturing group, any of:
\[if [^\]]+] #05: "[if ...]"
|<! #06: or "<!"
|> #07: or ">"
) #08: end non-capturing group
) #09: end look-ahead
(?: #10: non-capturing group:
(?!-->) #11: a position not followed by "-->"
. #12: eat the following char, it's part of the comment
)* #13: end non-capturing group, repeat
--> #14: "-->"
Steps #02 and #11 are crucial. #02 makes sure that the following characters do not indicate a conditional comment. After that, #11 makes sure that the following characters do not indicate the end of the comment, while #12 and #13 cause the actual matching.
Apply with "global" and "dotall" flags.
To do the opposite (match only conditional comments), it would be something like this:
<!(--)?(?=\[)(?:(?!<!\[endif\]\1>).)*<!\[endif\]\1>
Explanation:
<! #01: "<!"
(--)? #02: two dashes, optional
(?=\[) #03: a position followed by "["
(?: #04: non-capturing group:
(?! #05: a position not followed by
<!\[endif\]\1> #06: "<![endif]>" or "<![endif]-->" (depends on #02)
) #07: end of look-ahead
. #08: eat the following char, it's part of the comment
)* #09: end of non-capturing group, repeat
<!\[endif\]\1> #10: "<![endif]>" or "<![endif]-->" (depends on #02)
Again, apply with "global" and "dotall" flags.
Step #02 is because of the "downlevel-revealed" syntax, see: "MSDN - About Conditional Comments".
I'm not entirely sure where spaces are allowed or expected. Add \s*
to the expression where appropriate.
Something like this might work:
/<!--[^\[](.|\s)*?-->/
It's the same as yours, except that it ignores comments have an opening bracket immediately following the comment start tag.
If you can't get it to work with one regular expression or you find you want to preserve more comments you could use preg_replace_callback. You can then define a function to handle the comments individually.
<?php
function callback($buffer) {
return preg_replace_callback('/<!--.*-->/U', 'comment_replace_func', $buffer);
}
function comment_replace_func($m) {
if (preg_match( '/^\<\!--\[if \!/i', $m[0])) {
return $m[0];
}
return '';
}
ob_start("callback");
?>
... HTML source goes here ...
<?php ob_end_flush(); ?>
In summary this seems to be the best solution:
<?php
function callback($buffer) {
return preg_replace('/<!--[^\[](.|\s)*?-->/', '', $buffer);
}
ob_start("callback");
?>
... HTML source goes here ...
<?php ob_end_flush(); ?>
It strips out all comments and leaves conditionals with the exception of the top one:
<!--[if !IE]><!-->
<link rel="stylesheet" href="/css/screen.css" type="text/css" media="screen" />
<!-- <![endif]-->
where the additional seems to be causing the problem.
If anyone can suggest the regex which would take this into account and leave that condtional in place too then that would be perfect.
Tomalak's solution looks good but as a newbie and no further guidelines I don't know how to implement it although I would like to try it if anyone can elaborate on how to apply it?
Thanks