问题
I need to transcribe an image.tif with several pages to text using pytesseract. I have the next code:
> From PIL import Image
> Import pytesseract
> Pytesseract.pytesseract.tesseract_cmd = 'C: / Program Files (x86) / Tesseract-
> OCR / tesseract '
> Print (pytesseract.image_to_string (Image.open ('CAMARA.tif'), lang = "spa"))
The problem is that only extract the firs page. How can i extract all of them?
回答1:
I was able to fix the same problem by calling the method convert()
as below
image = Image.open(imagePath).convert("RGBA")
text = pytesseract.image_to_string(image)
print(text)
回答2:
I guess you have mentioned only one image "camara.tif" , First you have to convert all the pdf pages into images you can see this link for doing so.
And next use pytesseract to loop over images one by one to extract text from image.
回答3:
I just stumbled over the same problem... what you could do is call tesseract directly
# test.py
import subprocess
in_filename = 'file_0.tiff'
out_filename = 'out'
lang = 'spa'
subprocess.call(['tesseract', in_filename, '-l', lang, out_filename ])
would process all pages
$ python test.py Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica Page 1 Page 2 Page 3
来源:https://stackoverflow.com/questions/45292287/pytesseract-and-image-tif-file