Forcing Tesseract to give some answer

隐身守侯 提交于 2020-01-15 11:59:08

问题


I am trying to recognize one line of handwritten digits. Currently I do some preprocessing with Python and OpenCV, split the image into connected components and feed these components to Tesseract with PSM=10 (page segmentation mode, 10 is "treat the image like a single character") and character whitelist restricted to "0123456789". I expect Tesseract to return garbage where my connected component segmentation fails and to return exactly one digit when my segmentation succeeds. Tesseract often returns nothing at all.

I have tried both pytesseract and python-tesseract as a Tesseract interface for Python. Pytesseract works by locating the executable tesseract.exe, running it with suitable parameters from the shell and collecting the answer. This is how I found out about my problem. After that, I tried python-tesseract, which implements a full-blown C API. Naturally, the result was the same.

Below is a sample of 5 images I fed into Tesseract separately (I've also uploaded the same images as separate files here):

I get 1,*,4,*,* on these images, * meaning that Tesseract returned only whitespace.

With other page segmentation modes, I get the following:

PSM_SINGLE_CHAR: 1*4**
PSM_SINGLE_BLOCK_VERT_TEXT: **43*
PSM_CIRCLE_WORD: 11***
PSM_SINGLE_LINE: 11491
PSM_AUTO: *****
PSM_SPARSE_TEXT: *****
PSM_SINGLE_WORD: 11499
PSM_AUTO_ONLY: *****
PSM_SINGLE_COLUMN: *****
PSM_SPARSE_TEXT_OS: *****
PSM_SINGLE_BLOCK: 11499
PSM_OSD_ONLY: *****
PSM_AUTO_OSD: *****
PSM_COUNT: 11499

Weirdly, when I run tesseract image.png image -l eng -psm 10 digits-only against these images, it returns *,*,4,9,*. (digits-only is tessedit_char_whitelist 0123456789)

How do I force Tesseract to give me some answer instead of nothing at all?

来源:https://stackoverflow.com/questions/27321553/forcing-tesseract-to-give-some-answer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!