Strip HTML from Text JavaScript

前端 未结 30 3554
北荒
北荒 2020-11-21 05:08

Is there an easy way to take a string of html in JavaScript and strip out the html?

相关标签:
30条回答
  • 2020-11-21 05:37

    If you want to keep the links and the structure of the content (h1, h2, etc) then you should check out TextVersionJS You can use it with any HTML, although it was created to convert an HTML email to plain text.

    The usage is very simple. For example in node.js:

    var createTextVersion = require("textversionjs");
    var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
    
    var textVersion = createTextVersion(yourHtml);
    

    Or in the browser with pure js:

    <script src="textversion.js"></script>
    <script>
      var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
      var textVersion = createTextVersion(yourHtml);
    </script>
    

    It also works with require.js:

    define(["textversionjs"], function(createTextVersion) {
      var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
      var textVersion = createTextVersion(yourHtml);
    });
    
    0 讨论(0)
  • 2020-11-21 05:38

    I would like to share an edited version of the Shog9's approved answer.


    As Mike Samuel pointed with a comment, that function can execute inline javascript codes.
    But Shog9 is right when saying "let the browser do it for you..."

    so.. here my edited version, using DOMParser:

    function strip(html){
       let doc = new DOMParser().parseFromString(html, 'text/html');
       return doc.body.textContent || "";
    }
    

    here the code to test the inline javascript:

    strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")
    

    Also, it does not request resources on parse (like images)

    strip("Just text <img src='https://assets.rbl.ms/4155638/980x.jpg'>")
    
    0 讨论(0)
  • An improvement to the accepted answer.

    function strip(html)
    {
       var tmp = document.implementation.createHTMLDocument("New").body;
       tmp.innerHTML = html;
       return tmp.textContent || tmp.innerText || "";
    }
    

    This way something running like this will do no harm:

    strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")
    

    Firefox, Chromium and Explorer 9+ are safe. Opera Presto is still vulnerable. Also images mentioned in the strings are not downloaded in Chromium and Firefox saving http requests.

    0 讨论(0)
  • 2020-11-21 05:38
    var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");
    

    This is a regex version, which is more resilient to malformed HTML, like:

    Unclosed tags

    Some text <img

    "<", ">" inside tag attributes

    Some text <img alt="x > y">

    Newlines

    Some <a href="http://google.com">

    The code

    var html = '<br>This <img alt="a>b" \r\n src="a_b.gif" />is > \nmy<>< > <a>"text"</a'
    var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");
    
    0 讨论(0)
  • 2020-11-21 05:40

    Another, admittedly less elegant solution than nickf's or Shog9's, would be to recursively walk the DOM starting at the <body> tag and append each text node.

    var bodyContent = document.getElementsByTagName('body')[0];
    var result = appendTextNodes(bodyContent);
    
    function appendTextNodes(element) {
        var text = '';
    
        // Loop through the childNodes of the passed in element
        for (var i = 0, len = element.childNodes.length; i < len; i++) {
            // Get a reference to the current child
            var node = element.childNodes[i];
            // Append the node's value if it's a text node
            if (node.nodeType == 3) {
                text += node.nodeValue;
            }
            // Recurse through the node's children, if there are any
            if (node.childNodes.length > 0) {
                appendTextNodes(node);
            }
        }
        // Return the final result
        return text;
    }
    
    0 讨论(0)
  • 2020-11-21 05:42

    As an extension to the jQuery method, if your string might not contain HTML (eg if you are trying to remove HTML from a form field)

    jQuery(html).text();
    

    will return an empty string if there is no HTML

    Use:

    jQuery('<p>' + html + '</p>').text();
    

    instead.

    Update: As has been pointed out in the comments, in some circumstances this solution will execute javascript contained within html if the value of html could be influenced by an attacker, use a different solution.

    0 讨论(0)
提交回复
热议问题