Removing HTML entities while preserving line breaks with JSoup

前端未结

关注

 2  588

情书的邮戳

I have been using JSoup to parse lyrics and it has been great until now, but have run into a problem.

I can use Node.html() to return the full HTML of t

相关标签:

2条回答

無奈伤痛

2020-12-21 06:53

based on another answer from stackoverflow I added a few fixes and came with

    String text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2nl").replaceAll("\n", "br2nl")).text();
    text = text.replaceAll("br2nl ", "\n").replaceAll("br2nl", "\n").trim();

Hope this helps

0 讨论(0)

隐瞒了意图╮

2020-12-21 07:04

(disclaimer) I haven't used this API ... but a quick look at the docs suggests that you could visit each descendent node and dump out its text contents. Breaks could be inserted when special tags like <br> are encountered.

The TextNode.getWholeText() call also looks useful.

0 讨论(0)
发布评论:

提交评论
- 加载中...