Jsoup unescapes special characters

后端 未结 1 1355
一整个雨季
一整个雨季 2021-01-18 15:26

I\'m using Jsoup to remove all the images from an HTML page. I\'m receiving the page through an HTTP response - which also contains the content charset.

The problem

1条回答
  •  清酒与你
    2021-01-18 16:09

    Here is a workaround not involving any charset except the one specified in the HTTP header.

    String check = "

    isn’t

    ".replaceAll("&([^;]+?);", "**$1;"); Document doc = Jsoup.parse(check); doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended); System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));

    OUTPUT

    isn’t

    DISCUSSION

    I wish there was a solution in Jsoup's API - @dlv

    Using Jsoup'API would require you to write a custom NodeVisitor. It would leads to (re)inventing some existing code inside Jsoup. The custom Nodevisitor would generate back an HTML escape code instead of a unicode character.

    Another option would involve writing a custom character encoder. The default UTF-8 character encoder can encode . This is why Jsoup doesn't preserve the original escape sequence in the final HTML code.

    Any of the two above options represents a big coding effort. Ultimately, an enhancement could be added to Jsoup for letting us choose how to generate the characters in the final HTML code : hexadecimal escape (&#AB;), decimal escape (), the original escape sequence () or write the encoded character (which is the case in your post).

    0 讨论(0)
提交回复
热议问题