What is the correct way to detect whether string inputs contain HTML or not?

£可爱£侵袭症+ 提交于 2019-12-29 11:35:30

问题


When receiving user input on forms I want to detect whether fields like "username" or "address" does not contain markup that has a special meaning in XML (RSS feeds) or (X)HTML (when displayed).

So which of these is the correct way to detect whether the input entered doesn't contain any special characters in HTML and XML context?

if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)

or

if (htmlspecialchars($data, ENT_NOQUOTES, 'UTF-8') === $data)

or

if (preg_match("/[^\p{L}\-.']/u", $text)) // problem: also caches symbols

Have I missed anything else,like byte sequences or other tricky ways to get markup tags around things like "javascript:"? As far as I'm aware, all XSS and CSFR attacks require < or > around the values to get the browser to execute the code (well at least from Internet Explorer 6 or later anyway) - is this correct?

I am not looking for something to reduce or filter input. I just want to locate dangerous character sequences when used in XML or HTML context. (strip_tags() is horribly unsafe. As the manual says, it doesn't check for malformed HTML.)

Update

I think I need to clarify that there are a lot people mistaking this question for a question about basic security via "escaping" or "filtering" dangerous characters. This is not that question, and most of the simple answers given wouldn't solve that problem anyway.

Update 2: Example

  • User submits input
  • if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)
  • I save it

Now that the data is in my application I do two things with it - 1) display in a format like HTML - or 2) display inside a format element for editing.

The first one is safe in XML and HTML context

<h2><?php print $input; ?></h2>' <xml><item><?php print $input; ?></item></xml>

The second form is more dangerous, but it should still be safe:

<input value="<?php print htmlspecialchars($input, ENT_QUOTES, 'UTF-8');?>">

Update 3: Working Code

You can download the gist I created and run the code as a text or HTML response to see what I'm talking about. This simple check passes the http://ha.ckers.org XSS Cheat Sheet, and I can't find anything that makes it though. (I'm ignoring Internet Explorer 6 and below).

I started another bounty to award someone that can show a problem with this approach or a weakness in its implementation.

Update 4: Ask a DOM

It's the DOM that we want to protect - so why not just ask it? Timur's answer lead to this:

function not_markup($string)
{
    libxml_use_internal_errors(true);
    if ($xml = simplexml_load_string("<root>$string</root>"))
    {
        return $xml->children()->count() === 0;
    }
}

if (not_markup($_POST['title'])) ...

回答1:


I don't think you need to implement a huge algorithm to check if string has unsafe data - filters and regular expressions do the work. But, if you need a more complex check, maybe this will fit your needs:

<?php
$strings = array();
$strings[] = <<<EOD
    ';alert(String.fromCharCode(88,83,83))//\';alert(String.fromCharCode(88,83,83))//";alert(String.fromCharCode(88,83,83))//\";alert(String.fromCharCode(88,83,83))//--></SCRIPT>">'><SCRIPT>alert(String.fromCharCode(88,83,83))</SCRIPT>
EOD;
$strings[] = <<<EOD
    '';!--"<XSS>=&{()}
EOD;
$strings[] = <<<EOD
    <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>
EOD;
$strings[] = <<<EOD
    This is a safe text
EOD;
$strings[] = <<<EOD
    <IMG SRC="javascript:alert('XSS');">
EOD;
$strings[] = <<<EOD
    <IMG SRC=javascript:alert('XSS')>
EOD;
$strings[] = <<<EOD
    <IMG SRC=&#106;&#97;&#118;&#97;&#115;&#99;&#114;&#105;&#112;&#116;&#58;&#97;&#108;&#101;&#114;&#116;&#40;&#39;&#88;&#83;&#83;&#39;&#41;>
EOD;
$strings[] = <<<EOD
    perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out
EOD;
$strings[] = <<<EOD
    <SCRIPT/XSS SRC="http://ha.ckers.org/xss.js"></SCRIPT>
EOD;
$strings[] = <<<EOD
    </TITLE><SCRIPT>alert("XSS");</SCRIPT>
EOD;



libxml_use_internal_errors(true);
$sourceXML = '<root><element>value</element></root>';
$sourceXMLDocument = simplexml_load_string($sourceXML);
$sourceCount = $sourceXMLDocument->children()->count();

foreach( $strings as $string ){
    $unsafe = false;
    $XML = '<root><element>'.$string.'</element></root>';
    $XMLDocument = simplexml_load_string($XML);
    if( $XMLDocument===false ){
        $unsafe = true;
    }else{

        $count = $XMLDocument->children()->count();
        if( $count!=$sourceCount ){
            $unsafe = true;
        }
    }

    echo ($unsafe?'Unsafe':'Safe').': <pre>'.htmlspecialchars($string,ENT_QUOTES,'utf-8').'</pre><br />'."\n";
}
?>



回答2:


In a comment above, you wrote:

Just stop the browser from treating the string as markup.

This is an entirely different problem to the one in the title. The approach in the title is usually wrong. Stripping out tags just mangles input and can lead to data loss. Ever tried to talk about HTML on a blog that strips tags? Frustrating.

The solution that is usually the correct one is to do as you said in your comment - to stop the browser from treating the string as markup. This - literally taken - is not possible. What you do instead is encode the content as HTML.

Consider the following data:

<strong>Test</strong>

Now, you can look at this one of two ways. You can look at it as literal data - a sequence of characters. You can look at it as HTML - markup that includes strongly emphasises text.

If you just dump that out into an HTML document, you are treating it as HTML. You can't treat it as literal data in that context. What you need is HTML that will output the literal data. You need to encode it as HTML.

Your problem is not that you have too much HTML - it's that you have too little. When you output <, you are outputting raw data in an HTML context. You need to convert it to &lt;, which is the HTML representation of that data before outputting it.

PHP offers a few different options for doing this. The most direct is to use htmlspecialchars() to convert it into HTML, and then nl2br() to convert the line breaks into <br> elements.




回答3:


If you're just "looking for protection for print '<h3>' . $name . '</h3>'", then yes, at least the second approach is adequate, since it checks whether the value would be interpreted as markup if it weren't escaped. (In this case, the area where $name would appear is element content, and only the characters &, <, and > have special meaning when they appear in element content.) (For href and similar attributes, the check for "JavaScript: " may be necessary, but as you stated in a comment, that isn't a goal.)

For official sources, I can refer to the XML specification:

  • Content production in section 3.1: Here, content consists of elements, CDATA sections, processing instructions, and comments (which must begin with <), references (which must begin with &), and character data (which contains any other legal character). (Although a leading > is treated as character data in element content, many people usually escape it along with <, and it's better safe than sorry to treat it as special.)

  • Attribute value production in section 2.3: A valid attribute value consists of either references (which must begin with &) or character data (which contains any other legal character, but not < or the quote symbol used to wrap the attribute value). If you need to place string inputs in attributes in addition to element content, the characters " and ' need to be checked in addition to &, <, and possibly > (and other characters illegal in XML).

  • Section 2.2: Defines what Unicode code points are legal in XML. In particular, null is illegal in an XML document and may not display properly in HTML.

HTML5 (the latest working draft, which is a work in progress, describes a very elaborate parsing algorithm for HTML documents:

  • Element content corresponds to the "data state" in the parsing algorithm. Here, the string input should not contain a null character, < (which begins a new tag), or & (which begins a character reference).
  • Attribute values correspond to the "before attribute value state" in the parsing algorithm. For simplicity, we assume the attribute value is wrapped in double quotation marks. In that case, the parser moves to the "attribute value (double-quoted) state". In this case, the string input should not contain a null character, " (which ends the attribute value), or & (which begins a character reference).

If string inputs are to be placed in attribute values (unless placing them there is solely for display purposes), there are additional considerations to keep in mind. For example, HTML 4 specifies:

User agents should interpret attribute values as follows:

  • Replace character entities with characters,
  • Ignore line feeds,
  • Replace each carriage return or tab with a single space.

User agents may ignore leading and trailing white space in CDATA attribute values[.]

Attribute value normalization is also specified in the XML specification, but apparently not in HTML5.


EDIT (Apr. 25, 2019): Also, be suspicious of inputs containing—

  • the null code point (as it can cause parse errors in certain places, as specified in the HTML5 specification), or
  • any code point illegal in XML (as it will cause parse errors upon reading the XML document),

...assuming htmlspecialchars doesn't escape those code points already.




回答4:


I think you answered your own question. The function htmlspecialchars() does exactly what you need, but you should not use it until you write the user input to a page. To store it in a database there are other functions, like mysqli_real_escape_string().

As a rule of thumb, one can say that you should escape user input only when needed, for the given target system:

  1. Escaping user input often means a loss of the original data, and different target systems (HTML output / SQL / execution) need different escaping. They can even conflict with each other.
  2. You have to escape the data for the given purpose anyway, always. You should not trust even the entries from your database. So escaping when reading from user input does not have any big advantage, but double escaping can lead to invalid data.

In contrast to escaping, validating the content is a good thing to do early. If you expect an integer, only accept integers, otherwise refuse the user input.




回答5:


The correct way to detect whether string inputs contain HTML tags, or any other markup that has a special meaning in XML or (X)HTML when displayed (other than being an entity) is simply

if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)

You are correct! All XSS and CSFR attacks require < or > around the values to get the browser to execute the code (at least from IE6+).

Considering the output context given, this is sufficient to safely display in a format like HTML:

<h2><?php print $input; ?></h2> <xml><item><?php print $input; ?></item></xml>

Of course, if we have any entity in the input, like &aacute;, a browser will not output it as &aacute;, but as á, unless we use a function like htmlspecialchars when doing the output. In this case, even the < and > would be also safe.

In the case of using the string input as the value of an attribute, the safety depends on the attribute.

If the attribute is an input value, we must quote it and use a function like htmlspecialchars in order to have the same content back for editing.

<input value="<?php print htmlspecialchars($input, ENT_QUOTES, 'UTF-8');?>">

Again, even the < and > characters would be safe here.

We may conclude that we do not have to do any kind of detection and rejection of the input, if we will always use htmlspecialchars to output it, and our context will fit always the above cases (or equally safe ones).

[And we also have a number of ways to safely store it in the database, preventing SQL exploits.]

What if the user wants his "username" to be &amp; is not an &? It does not contain < nor >... will we detect and reject it? Will we accept it? How will we display it? (This input gives interesting results in the new bounty!)

Finally, if our context expands, and we will use the string input as an anchor href, then our whole approach suddenly changes dramatically. But this scenario is not included in the question.

(It worths mentioning that even using htmlspecialchars the output of a string input may differ if the character encodings are different on each step.)




回答6:


HTML Purifier does a good job and is very easy to implement. You could also use a Zend Framework filter like Zend_Filter_StripTags.

HTML Purifier doesn't just fix HTML.




回答7:


I am certainly not a security expert, but from what I gather something like your suggested

if (htmlspecialchars($data, ENT_NOQUOTES, 'UTF-8') === $data)

should work to prevent you from passing on contaminated strings, given you got your encoding right there.

XSS attacks that don't require '<' or '>' rely on the string being handled in a JavaScript block right there and then, which, from how I read your question, is not what you are concerned with in this situation.




回答8:


I suggest you to take a look at the xss_clean function from CodeIgniter. I know you don't want to clean, sanitize, or filter anything. You just want to "detect bad behaviour" and reject it. That's exactly why I recommend you to look at this function code.

IMO, we can find a deep and strong XSS vulnerability knowledge there, including all the knowledge you want and need with your question.

Then, my short / direct answer to you would be:

if (xss_clean($data) === $data)

Now, you don't need to use the whole CodeIgniter framework just because you need this single function, of course. But I believe you may want to grab the whole CI_Security class (at /system/core/Security.php) and do a few modifications to eliminate other dependencies.

As you will see, xss_clean code is quite complex, as XSS vulnerabilities really are, and I would just trust it and do not try to "reinvent this wheel"... IMHO, you can't get rid of XSS vulnerabilities by merely detecting a dozen of characters.




回答9:


You could use a regular expression if you know the character sets that are allowed. IF a character is in the username that isn't allowed then throw an error:

[a-zA-Z0-9_.-]

Test your regular expressions here: http://www.perlfect.com/articles/regextutor.shtml

<?php
$username = "abcdef";
$pattern = '/[a-zA-Z0-9_.-]/';
preg_match($pattern, $username, $matches);
print_r($matches);
?>



回答10:


If the reason of the question is for XSS prevention, there are several ways to explode a XSS vulnerability. A great cheatsheet about this is the XSS Cheatsheet at ha.ckers.org.

But, detection is useless in this case. You only need prevention, and the correct use of htmlspecialchars/htmlentities on your text inputs before saving them to your database is faster and better than detecting bad input.




回答11:


filter_input + FILTER_SANITIZE_STRING (there are lots of flag you can chose from)

:- http://www.php.net/manual/en/filter.filters.sanitize.php




回答12:


Regex is still the most efficient way of solving your problem. It doesn't matter what frameworks you plan to use, or are advised to use, the most efficient way would still be a custom regex code. You can test the string with a regex, and remove (or convert) the affected section using htmlcharacter function.
No need to install any other framework, or use some long-winded application.




回答13:


You can make use of the strip_tags function in PHP. This function will strip HTML and PHP tags from given data.

For example, $data is the variable which holds your content then you can use this like this:

if (strlen($data) != strlen(strip_tags($data))){
    return false;
} 
else{
    return true;
}

It will check stripped content against the original content. If both are equal then we can hope there aren't any HTML tags, and it returns true. Otherwise, it returns false as it found some HTML tags.



来源:https://stackoverflow.com/questions/8419038/what-is-the-correct-way-to-detect-whether-string-inputs-contain-html-or-not

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!