问题
How to configure Tika OCR with solr 6.4.1. I indexed documents including PDF, images and MS office documents but problem was occurred Tika was not extracting text from images and also from images which are inside PDF and MS office documents. for this I researched Tika OCR is used. for this purpose i am installing tika-app-1.7.jar and Tesseract but i don't know how to configure them with my solr core.
回答1:
You don't need to do anything special. Simply get the Tesseract OCR setup for your distro and install it on the system. Make sure your PATH
variable has an entry for the Tesseract home directory, and the TESSDATA_PREFIX
variable is set and also points to the Tesseract home directory. Restart Solr and you're good to go. You should be able to see the OCR component when you push documents to the index through the /update/extract
handler.
By default, Tesseract only ships with the English model. Get models for other languages from here.
来源:https://stackoverflow.com/questions/43017921/configure-tesseract-with-solr-6-4-1