Removing HTML entities while preserving line breaks with JSoup

前端 未结 2 588
情书的邮戳
情书的邮戳 2020-12-21 06:34

I have been using JSoup to parse lyrics and it has been great until now, but have run into a problem.

I can use Node.html() to return the full HTML of t

相关标签:
2条回答
  • 2020-12-21 06:53

    based on another answer from stackoverflow I added a few fixes and came with

        String text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2nl").replaceAll("\n", "br2nl")).text();
        text = text.replaceAll("br2nl ", "\n").replaceAll("br2nl", "\n").trim();
    

    Hope this helps

    0 讨论(0)
  • 2020-12-21 07:04

    (disclaimer) I haven't used this API ... but a quick look at the docs suggests that you could visit each descendent node and dump out its text contents. Breaks could be inserted when special tags like <br> are encountered.

    The TextNode.getWholeText() call also looks useful.

    0 讨论(0)
提交回复
热议问题