There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certa
Implement a custom org.apache.tika.extractor.DocumentSelector
and set it at the ParseContext
. The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed.
Example DocumentSelector:
public class CustomDocumentSelector implements DocumentSelector {
@Override
public boolean select(Metadata metadata) {
String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
return resourceName == null || !resourceName.endsWith(".joboptions");
}
}
Register it at the ParseContext:
parseContext.set(DocumentSelector.class, new CustomDocumentSelector());
@gagravarr, we changed that behavior via TIKA-2096, Tika 1.15. The default is now "extract all embedded documents". To avoid parsing embedded documents call:
parseContext.set(Parser.class, new EmptyParser())
Or subclass EmbeddedDocumentExtractor
to do nothing and send that in via the ParseContext
.
If you were using Solr DIH's TikaEntityProcessor
, I'd set extractEmbedded
to false
, but you aren't; and please don't. :)
So, I don't think there's an easy way to turn off parsing of embedded documents only for PDFs, and I'm not sure you'd want to. What if there were an MSWord file attached to a PDF, for example?
If you want to ignore .joboptions
, you could use a custom EmbeddedDocumentExtractor
.