Jsoup - Howto clean html by escaping not deleting the unwanted html?

后端 未结 1 519
谎友^
谎友^ 2021-02-15 11:02

Is there a way of getting jsoup to clean a string with HTML in it by escaping the unwanted HTML rather than removing it completely? My example:

String dirty = \         


        
相关标签:
1条回答
  • 2021-02-15 11:43

    Assuming String rather than HTML documents are being parsed (as per your question) this method will work:

    public String escapeHtml(String source) {
        Document doc = Jsoup.parseBodyFragment(source);
        Elements elements = doc.select("b");
        for (Element element : elements) {
            element.replaceWith(new TextNode(element.toString(),""));
        }
        return Jsoup.clean(doc.body().toString(), new Whitelist().addTags("a").addAttributes("a", "href", "name", "rel", "target"));
    }
    

    You could make the "b" tag an argument to pass in a list of tags you wish to escape.

    The associated passing JUnit test:

    @Test
    public void testHtmlEscaping() throws Exception {
        String source = "This is <b>REALLY</b> dirty code from <a href=\"www.rubbish.url.zzzz\">haxors-r-us</a>";
        String expected = "This is &lt;b&gt;REALLY&lt;/b&gt; dirty code from \n<a href=\"www.rubbish.url.zzzz\">haxors-r-us</a>";
        String transformed = transformer.escapeHtml(source);
        assertEquals(transformed, expected);
    }
    

    Note that I added a line return "\n" before your "a" tag in my test's "expected" String because JSoup formats the page.

    0 讨论(0)
提交回复
热议问题