I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <
Building on your example, with a hint from html to plain text? message:
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
public class TestJsoup
{
public void SimpleParse()
{
try
{
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
// Trick for better formatting
doc.body().wrap("");
String text = doc.text();
// Converting nbsp entities
text = text.replaceAll("\u00A0", " ");
System.out.print(text);
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String args[])
{
TestJsoup tjs = new TestJsoup();
tjs.SimpleParse();
}
}