问题
I'm learning OCR using PyTesser
and Tesseract
. As the first milestone, I want to write a tool to recognize captcha that simply consists of some digits. I read some tutorials and wrote such a test program.
from pytesser.pytesser import *
from PIL import Image, ImageFilter, ImageEnhance
im = Image.open("test.tiff")
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
text = image_to_string(im)
print "text={}".format(text)
I tested my code with the image below. But the result is 2(T?770
. And I've tested some other similar images as well, in 80% case the results are incorrect.
I'm not familiar with imaging processing. I've two questions here:
Is it possible to tell
PyTesser
to guess digits only?I think the image is quite easy for human to read. If it is so difficult for
PyTesser
to read digits only image, is there any alternatives can do a better OCR?
Any hints are very appreciated.
回答1:
I think your code is quite okay. It can recognize 207770
. The problem is at pytesser
installation. The Tesseract
in pytesser
is out-of-date. You'd download a most recent version and overwrite corresponding files. You'd also edit pytesser.py
and change
tesseract_exe_name = 'tesseract'
to
import os.path
tesseract_exe_name = os.path.join(os.path.dirname(__file__), 'tesseract')
来源:https://stackoverflow.com/questions/24247813/recognize-simple-digits-with-pytesser