问题
The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy
. e.g:
$ curl -T test/Dokument01.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only"
From a lot of different documents about tika I found these documented additional header parameters:
X-Tika-OCRLanguage: eng
X-Tika-PDFextractInlineImages: true | false
X-Tika-PDFOcrStrategy: ocr_only | ocr_and_text_extraction
X-Tika-OCRoutputType: hocr
But there seems to be no documentation about how to use the X-Tika-.....?
header parameters or which parameters are supported and which not.
For example I wonder if it is possible to overwrite the ImageType mode or the DPI with something like:
X-Tika-PDFocrImageType: rgb
X-Tika-PDFocrDPI: 100
My question is: Which header parameters are supported and which naming convention did these params follow?
回答1:
The code that handles the X-Tika-OCR
and X-Tika-PDF
headers is TikaResource.processHeaderConfig.
Those header suffixes and values are then mapped onto the TesseractOCRConfig and PDFParserConfig configuration objects via reflection.
So, to see what X-Tika
headers you can set, look up the options on the config class you want to tweak things on (Tesseract or PDF), then build the name, then set the header. If you are not sure what the option does, or what values it takes, look at the JavaDocs for the underlying setter method that will get called.
For eg setExtractInlineImages on PDF, that maps to X-Tika-PDFextractInlineImages
来源:https://stackoverflow.com/questions/62011038/apache-tika-server-request-header-parameters