Jsoup unescapes special characters

后端未结

关注

 1  1357

I\'m using Jsoup to remove all the images from an HTML page. I\'m receiving the page through an HTTP response - which also contains the content charset.

The problem

相关标签:

1条回答

清酒与你

2021-01-18 16:09
Here is a workaround not involving any charset except the one specified in the HTTP header.
```
String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>".replaceAll("&([^;]+?);", "**$1;");

Document doc = Jsoup.parse(check);

doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);

System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));
```
OUTPUT
```
<html><head></head><body><p>isn&rsquo;t</p></body></html>
```
DISCUSSION

I wish there was a solution in Jsoup's API - @dlv

Using Jsoup'API would require you to write a custom NodeVisitor. It would leads to (re)inventing some existing code inside Jsoup. The custom Nodevisitor would generate back an HTML escape code instead of a unicode character.

Another option would involve writing a custom character encoder. The default UTF-8 character encoder can encode ’. This is why Jsoup doesn't preserve the original escape sequence in the final HTML code.

Any of the two above options represents a big coding effort. Ultimately, an enhancement could be added to Jsoup for letting us choose how to generate the characters in the final HTML code : hexadecimal escape (&#AB;), decimal escape (), the original escape sequence (’) or write the encoded character (which is the case in your post).
0 讨论(0)
发布评论:

提交评论
- 加载中...