I\'m using Jsoup to remove all the images from an HTML page. I\'m receiving the page through an HTTP response - which also contains the content charset.
The problem
Here is a workaround not involving any charset except the one specified in the HTTP header.
String check = "isn’t
".replaceAll("&([^;]+?);", "**$1;");
Document doc = Jsoup.parse(check);
doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);
System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));
OUTPUT
isn’t
DISCUSSION
I wish there was a solution in Jsoup's API - @dlv
Using Jsoup'API would require you to write a custom NodeVisitor. It would leads to (re)inventing some existing code inside Jsoup. The custom Nodevisitor would generate back an HTML escape code instead of a unicode character.
Another option would involve writing a custom character encoder. The default UTF-8 character encoder can encode ’
. This is why Jsoup doesn't preserve the original escape sequence in the final HTML code.
Any of the two above options represents a big coding effort. Ultimately, an enhancement could be added to Jsoup for letting us choose how to generate the characters in the final HTML code : hexadecimal escape (AB;
), decimal escape (
), the original escape sequence (’
) or write the encoded character (which is the case in your post).