I download tika-core and tika-parser libraries, but I could not find the example codes to parse HTML documents to string. I have to get rid of all html tags of source of a web p
Do you want a plain text version of a html file? If so, all you need is something like:
InputStream input = new FileInputStream("myfile.html");
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
new HtmlParser().parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
The BodyContentHandler, when created with no constructor arguments or with a character limit, will capture the text (only) of the body of the html and return it to you.
You can also you Tika AutoDetectParser to parse any type of files such as HTML. Here is a simple example of that:
try {
InputStream input = new FileInputStream(new File(path));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(input, textHandler, metadata, context);
System.out.println("Title: " + metadata.get(metadata.TITLE));
System.out.println("Body: " + textHandler.toString());
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}