I have been using JSoup to parse lyrics and it has been great until now, but have run into a problem.
I can use Node.html()
to return the full HTML of t
based on another answer from stackoverflow I added a few fixes and came with
String text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2nl").replaceAll("\n", "br2nl")).text();
text = text.replaceAll("br2nl ", "\n").replaceAll("br2nl", "\n").trim();
Hope this helps
(disclaimer) I haven't used this API ...
but a quick look at the docs suggests that you could visit each descendent node and dump out its text contents. Breaks could be inserted when special tags like <br>
are encountered.
The TextNode.getWholeText() call also looks useful.