Tika Parser: Exclude PDF Attachments

后端未结

关注

 2  1042

There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certa

相关标签:

2条回答

星月不相逢

2021-01-23 01:40
Implement a custom org.apache.tika.extractor.DocumentSelector and set it at the ParseContext. The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed.

Example DocumentSelector:
```
public class CustomDocumentSelector implements DocumentSelector {

  @Override
  public boolean select(Metadata metadata) {
    String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
    return resourceName == null || !resourceName.endsWith(".joboptions");
  }
}
```
Register it at the ParseContext:
```
parseContext.set(DocumentSelector.class, new CustomDocumentSelector());
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
名媛妹妹

2021-01-23 01:56

@gagravarr, we changed that behavior via TIKA-2096, Tika 1.15. The default is now "extract all embedded documents". To avoid parsing embedded documents call:

parseContext.set(Parser.class, new EmptyParser())

Or subclass EmbeddedDocumentExtractor to do nothing and send that in via the ParseContext.

If you were using Solr DIH's TikaEntityProcessor, I'd set extractEmbedded to false, but you aren't; and please don't. :)

So, I don't think there's an easy way to turn off parsing of embedded documents only for PDFs, and I'm not sure you'd want to. What if there were an MSWord file attached to a PDF, for example?

If you want to ignore .joboptions, you could use a custom EmbeddedDocumentExtractor.

0 讨论(0)
发布评论:

提交评论
- 加载中...