strip_tags disallow some tags

问题

Based on the strip_tags documentation, the second parameter takes the allowable tags. However in my case, I want to do the reverse. Say I'll accept the tags the script_tags normally (default) accept, but strip only the <script> tag. Any possible way for this?

I don't mean somebody to code it for me, but rather an input of possible ways on how to achieve this (if possible) is greatly appreciated.

回答1:

EDIT

To use the HTML Purifier HTML.ForbiddenElements config directive, it seems you would do something like:

require_once '/path/to/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.ForbiddenElements', array('script','style','applet'));
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);

http://htmlpurifier.org/docs

HTML.ForbiddenElements should be set to an array. What I don't know is what form the array members should take:

array('script','style','applet')

Or:

array('<script>','<style>','<applet>')

Or... Something else?

I think it's the first form, without delimiters; HTML.AllowedElements uses a form of configuration string somewhat common to TinyMCE's valid elements syntax:

tinyMCE.init({
    ...
    valid_elements : "a[href|target=_blank],strong/b,div[align],br",
    ...
});

So my guess is it's just the term, and no attributes should be provided (since you're banning the element... although there is a HTML.ForbiddenAttributes, too). But that's a guess.

I'll add this note from the HTML.ForbiddenAttributes docs, as well:

Warning: This directive complements %HTML.ForbiddenElements, accordingly, check out that directive for a discussion of why you should think twice before using this directive.

Blacklisting is just not as "robust" as whitelisting, but you may have your reasons. Just beware and be careful.

Without testing, I'm not sure what to tell you. I'll keep looking for an answer, but I will likely go to bed first. It is very late. :)

Although I think you really should use HTML Purifier and utilize it's HTML.ForbiddenElements configuration directive, I think a reasonable alternative if you really, really want to use strip_tags() is to derive a whitelist from the blacklist. In other words, remove what you don't want and then use what's left.

For instance:

function blacklistElements($blacklisted = '', &$errors = array()) {
    if ((string)$blacklisted == '') {
        $errors[] = 'Empty string.';
        return array();
    }

    $html5 = array(
        "<menu>","<command>","<summary>","<details>","<meter>","<progress>",
        "<output>","<keygen>","<textarea>","<option>","<optgroup>","<datalist>",
        "<select>","<button>","<input>","<label>","<legend>","<fieldset>","<form>",
        "<th>","<td>","<tr>","<tfoot>","<thead>","<tbody>","<col>","<colgroup>",
        "<caption>","<table>","<math>","<svg>","<area>","<map>","<canvas>","<track>",
        "<source>","<audio>","<video>","<param>","<object>","<embed>","<iframe>",
        "<img>","<del>","<ins>","<wbr>","<br>","<span>","<bdo>","<bdi>","<rp>","<rt>",
        "<ruby>","<mark>","<u>","<b>","<i>","<sup>","<sub>","<kbd>","<samp>","<var>",
        "<code>","<time>","<data>","<abbr>","<dfn>","<q>","<cite>","<s>","<small>",
        "<strong>","<em>","<a>","<div>","<figcaption>","<figure>","<dd>","<dt>",
        "<dl>","<li>","<ul>","<ol>","<blockquote>","<pre>","<hr>","<p>","<address>",
        "<footer>","<header>","<hgroup>","<aside>","<article>","<nav>","<section>",
        "<body>","<noscript>","<script>","<style>","<meta>","<link>","<base>",
        "<title>","<head>","<html>"
    );

    $list = trim(strtolower($blacklisted));
    $list = preg_replace('/[^a-z ]/i', '', $list);
    $list = '<' . str_replace(' ', '> <', $list) . '>';
    $list = array_map('trim', explode(' ', $list));

    return array_diff($html5, $list);
}

Then run it:

$blacklisted = '<html> <bogus> <EM> em li ol';
$whitelist = blacklistElements($blacklisted);

if (count($errors)) {
    echo "There were errors.\n";
    print_r($errors);
    echo "\n";
} else {
    // Do strip_tags() ...
}

http://codepad.org/LV8ckRjd

So if you pass in what you don't want to allow, it will give you back the HTML5 element list in an array form that you can then feed into strip_tags() after joining it into a string:

$stripped = strip_tags($html, implode('', $whitelist)));

Caveat Emptor

Now, I've kind've hacked this together and I know there are some issues I haven't thought out yet. For instance, from the strip_tags() man page for the $allowable_tags argument:

Note:

This parameter should not contain whitespace. strip_tags() sees a tag as a case-insensitive string between < and the first whitespace or >. It means that strip_tags("<br/>", "<br>") returns an empty string.

It's late and for some reason I can't quite figure out what this means for this approach. So I'll have to think about that tomorrow. I also compiled the HTML element list in the function's $html5 element from this MDN documentation page. Sharp-eyed reader's might notice all of the tags are in this form:

<tagName>

I'm not sure how this will effect the outcome, whether I need to take into account variations in the use of a shorttag <tagName/> and some of the, ahem, odder variations. And, of course, there are more tags out there.

So it's probably not production ready. But you get the idea.

回答2:

First, see what others have said on this topic:

Strip <script> tags and everything in between with PHP?

and

remove script tag from HTML content

It seems you have 2 choices, one is a Regex solution, both the links above give them. The second is to use HTML Purifier.

If you are stripping the script tag for some other reason than sanitation of user content, the Regex could be a good solution. However, as everyone has warned, it is a good idea to use HTML Purifier if you are sanitizing input.

回答3:

PHP(5 or greater) solution:

If you want to remove <script> tags (or any other), and also you want to remove the content inside tags, you should use:

OPTION 1 (simplest):

preg_replace('#<script(.*?)>(.*?)</script>#is', '', $text);

OPTION 2 (more versatile):

<?php

$html = "<p>Your HTML code</p><script>With malicious code</script>"

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $item->parentNode->removeChild($item);
}

$html = $dom->saveHTML();

Then $html will be:

"<p>Your HTML code</p>"

回答4:

This is what I use to strip out a list of forbidden tags, can do both removing of tags wrapping content and tags including content, Plus trim off leftover white space.

$description = trim(preg_replace([
    # Strip tags around content
    '/\<(.*)doctype(.*)\>/i',
    '/\<(.*)html(.*)\>/i',
    '/\<(.*)head(.*)\>/i',
    '/\<(.*)body(.*)\>/i',
    # Strip tags and content inside
    '/\<(.*)script(.*)\>(.*)<\/script>/i',
], '', $description));

Input example:

$description = '<html>
<head>
</head>
<body>
    <p>This distinctive Mini Chopper with Desire styling has a powerful wattage and high capacity which makes it a very versatile kitchen accessory. It also comes equipped with a durable glass bowl and lid for easy storage.</p>
    <script type="application/javascript">alert('Hello world');</script>
</body>
</html>';

Output result:

<p>This distinctive Mini Chopper with Desire styling has a powerful wattage and high capacity which makes it a very versatile kitchen accessory. It also comes equipped with a durable glass bowl and lid for easy storage.</p>

回答5:

I use the following:

function strip_tags_with_forbidden_tags($input, $forbidden_tags)
{
    foreach (explode(',', $forbidden_tags) as $tag) {
        $tag = preg_replace(array('/^</', '/>$/'), array('', ''), $tag);
        $input = preg_replace(sprintf('/<%s[^>]*>([^<]+)<\/%s>/', $tag, $tag), '$1', $input);
    }

    return $input;
}

Then you can do:

echo strip_tags_with_forbidden_tags('<cancel>abc</cancel>xpto<p>def></p><g>xyz</g><t>xpto</t>', 'cancel,g');

Output: 'abcxpto<p>def></p>xyz<t>xpto</t>'

echo strip_tags_with_forbidden_tags('<cancel>abc</cancel> xpto <p>def></p> <g>xyz</g> <t>xpto</t>', 'cancel,g');

Outputs: 'abc xpto <p>def></p> xyz <t>xpto</t>'

来源：https://stackoverflow.com/questions/12362426/strip-tags-disallow-some-tags

标签

php

html

strip-tags