问题
I am trying to extract all the text out of various documents. And for that I am using Apache Tika 1.4.
RecursiveTikaParser parser = new RecursiveTikaParser(new AutoDetectParser());
ParseContext parseContext = new ParseContext();
parseContext.set(Parser.class, parser);
RecursiveTikaParser here is just a wrapper on AutoDetectParser.
Parse method for which is something like this -
ContentHandler content = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
super.parse(stream, content, metadata, context);
System.out.println("Parsed text is " + content.toString());
Now, this code has to be able to handle multiple files so that's why I am using AutoDetectParser()
I noticed in my testing that given an xml file - I can only extract the text that is between the tags and not the comments, tags.
Is it possible to extract everything from the text file with my current approach ?
回答1:
Try like this
Metadata metadata = new Metadata();
stream = TikaInputStream.get(stream, null);
String mimtType = DETECTOR.detect(stream, metadata).toString();
Parser parser;
if (mimtType.equalsIgnoreCase("application/xml")) {
parser = new TXTParser();
} else {
parser = new AutoDetectParser();
}
ContentHandler content = new BodyContentHandler();
parser.parse(stream, content, metadata, new ParseContext());
System.out.println(content.toString());
来源:https://stackoverflow.com/questions/21175172/extract-text-from-xml-tags-in-an-xml-file-using-apach-tika-parser