What is the correct way to detect whether string inputs contain HTML or not?

前端 未结 13 659
旧时难觅i
旧时难觅i 2020-12-23 15:08

When receiving user input on forms I want to detect whether fields like \"username\" or \"address\" does not contain markup that has a special meaning in XML (RSS feeds) or

相关标签:
13条回答
  • 2020-12-23 15:17

    You could use a regular expression if you know the character sets that are allowed. IF a character is in the username that isn't allowed then throw an error:

    [a-zA-Z0-9_.-]
    

    Test your regular expressions here: http://www.perlfect.com/articles/regextutor.shtml

    <?php
    $username = "abcdef";
    $pattern = '/[a-zA-Z0-9_.-]/';
    preg_match($pattern, $username, $matches);
    print_r($matches);
    ?>
    
    0 讨论(0)
  • 2020-12-23 15:19

    I think you answered your own question. The function htmlspecialchars() does exactly what you need, but you should not use it until you write the user input to a page. To store it in a database there are other functions, like mysqli_real_escape_string().

    As a rule of thumb, one can say that you should escape user input only when needed, for the given target system:

    1. Escaping user input often means a loss of the original data, and different target systems (HTML output / SQL / execution) need different escaping. They can even conflict with each other.
    2. You have to escape the data for the given purpose anyway, always. You should not trust even the entries from your database. So escaping when reading from user input does not have any big advantage, but double escaping can lead to invalid data.

    In contrast to escaping, validating the content is a good thing to do early. If you expect an integer, only accept integers, otherwise refuse the user input.

    0 讨论(0)
  • 2020-12-23 15:20

    I don't think you need to implement a huge algorithm to check if string has unsafe data - filters and regular expressions do the work. But, if you need a more complex check, maybe this will fit your needs:

    <?php
    $strings = array();
    $strings[] = <<<EOD
        ';alert(String.fromCharCode(88,83,83))//\';alert(String.fromCharCode(88,83,83))//";alert(String.fromCharCode(88,83,83))//\";alert(String.fromCharCode(88,83,83))//--></SCRIPT>">'><SCRIPT>alert(String.fromCharCode(88,83,83))</SCRIPT>
    EOD;
    $strings[] = <<<EOD
        '';!--"<XSS>=&{()}
    EOD;
    $strings[] = <<<EOD
        <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>
    EOD;
    $strings[] = <<<EOD
        This is a safe text
    EOD;
    $strings[] = <<<EOD
        <IMG SRC="javascript:alert('XSS');">
    EOD;
    $strings[] = <<<EOD
        <IMG SRC=javascript:alert('XSS')>
    EOD;
    $strings[] = <<<EOD
        <IMG SRC=&#106;&#97;&#118;&#97;&#115;&#99;&#114;&#105;&#112;&#116;&#58;&#97;&#108;&#101;&#114;&#116;&#40;&#39;&#88;&#83;&#83;&#39;&#41;>
    EOD;
    $strings[] = <<<EOD
        perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out
    EOD;
    $strings[] = <<<EOD
        <SCRIPT/XSS SRC="http://ha.ckers.org/xss.js"></SCRIPT>
    EOD;
    $strings[] = <<<EOD
        </TITLE><SCRIPT>alert("XSS");</SCRIPT>
    EOD;
    
    
    
    libxml_use_internal_errors(true);
    $sourceXML = '<root><element>value</element></root>';
    $sourceXMLDocument = simplexml_load_string($sourceXML);
    $sourceCount = $sourceXMLDocument->children()->count();
    
    foreach( $strings as $string ){
        $unsafe = false;
        $XML = '<root><element>'.$string.'</element></root>';
        $XMLDocument = simplexml_load_string($XML);
        if( $XMLDocument===false ){
            $unsafe = true;
        }else{
    
            $count = $XMLDocument->children()->count();
            if( $count!=$sourceCount ){
                $unsafe = true;
            }
        }
    
        echo ($unsafe?'Unsafe':'Safe').': <pre>'.htmlspecialchars($string,ENT_QUOTES,'utf-8').'</pre><br />'."\n";
    }
    ?>
    
    0 讨论(0)
  • 2020-12-23 15:21

    The correct way to detect whether string inputs contain HTML tags, or any other markup that has a special meaning in XML or (X)HTML when displayed (other than being an entity) is simply

    if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)

    You are correct! All XSS and CSFR attacks require < or > around the values to get the browser to execute the code (at least from IE6+).

    Considering the output context given, this is sufficient to safely display in a format like HTML:

    <h2><?php print $input; ?></h2> <xml><item><?php print $input; ?></item></xml>

    Of course, if we have any entity in the input, like &aacute;, a browser will not output it as &aacute;, but as á, unless we use a function like htmlspecialchars when doing the output. In this case, even the < and > would be also safe.

    In the case of using the string input as the value of an attribute, the safety depends on the attribute.

    If the attribute is an input value, we must quote it and use a function like htmlspecialchars in order to have the same content back for editing.

    <input value="<?php print htmlspecialchars($input, ENT_QUOTES, 'UTF-8');?>">

    Again, even the < and > characters would be safe here.

    We may conclude that we do not have to do any kind of detection and rejection of the input, if we will always use htmlspecialchars to output it, and our context will fit always the above cases (or equally safe ones).

    [And we also have a number of ways to safely store it in the database, preventing SQL exploits.]

    What if the user wants his "username" to be &amp; is not an &? It does not contain < nor >... will we detect and reject it? Will we accept it? How will we display it? (This input gives interesting results in the new bounty!)

    Finally, if our context expands, and we will use the string input as an anchor href, then our whole approach suddenly changes dramatically. But this scenario is not included in the question.

    (It worths mentioning that even using htmlspecialchars the output of a string input may differ if the character encodings are different on each step.)

    0 讨论(0)
  • 2020-12-23 15:21

    Regex is still the most efficient way of solving your problem. It doesn't matter what frameworks you plan to use, or are advised to use, the most efficient way would still be a custom regex code. You can test the string with a regex, and remove (or convert) the affected section using htmlcharacter function.
    No need to install any other framework, or use some long-winded application.

    0 讨论(0)
  • 2020-12-23 15:23

    If the reason of the question is for XSS prevention, there are several ways to explode a XSS vulnerability. A great cheatsheet about this is the XSS Cheatsheet at ha.ckers.org.

    But, detection is useless in this case. You only need prevention, and the correct use of htmlspecialchars/htmlentities on your text inputs before saving them to your database is faster and better than detecting bad input.

    0 讨论(0)
提交回复
热议问题