Jsoup unescapes special characters

后端 未结 1 1357
一整个雨季
一整个雨季 2021-01-18 15:26

I\'m using Jsoup to remove all the images from an HTML page. I\'m receiving the page through an HTTP response - which also contains the content charset.

The problem

相关标签:
1条回答
  • 2021-01-18 16:09

    Here is a workaround not involving any charset except the one specified in the HTTP header.

    String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>".replaceAll("&([^;]+?);", "**$1;");
    
    Document doc = Jsoup.parse(check);
    
    doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);
    
    System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));
    

    OUTPUT

    <html><head></head><body><p>isn&rsquo;t</p></body></html>
    

    DISCUSSION

    I wish there was a solution in Jsoup's API - @dlv

    Using Jsoup'API would require you to write a custom NodeVisitor. It would leads to (re)inventing some existing code inside Jsoup. The custom Nodevisitor would generate back an HTML escape code instead of a unicode character.

    Another option would involve writing a custom character encoder. The default UTF-8 character encoder can encode &rsquo;. This is why Jsoup doesn't preserve the original escape sequence in the final HTML code.

    Any of the two above options represents a big coding effort. Ultimately, an enhancement could be added to Jsoup for letting us choose how to generate the characters in the final HTML code : hexadecimal escape (&#AB;), decimal escape (&#151;), the original escape sequence (&rsquo;) or write the encoded character (which is the case in your post).

    0 讨论(0)
提交回复
热议问题