Why does tesseract fail to read text off this simple image?

删除回忆录丶 提交于 2021-01-24 03:08:51

问题


I have read mountains of posts on pytesseract, but I cannot get it to read text off a dead simple image; It returns an empty string.

Here is the image:

I have tried scaling it, grayscaling it, and adjusting the contrast, thresholding, blurring, everything it says in other posts, but my problem is that I don't know what the OCR wants to work better. Does it want blurry text? High contrast?

Code to try:

import pytesseract
from PIL import Image

print pytesseract.image_to_string(Image.open(IMAGE FILE))

As you can see in my code, the image is stored locally on my computer, hence Image.open()


回答1:


Trying something along the lines of

import pytesseract 
from PIL import Image 
import requests 
import io

response = requests.get('https://i.stack.imgur.com/J2ojU.png') 
img = Image.open(io.BytesIO(response.content))
text = pytesseract.image_to_string(img, lang='eng', config='--psm 7')

print(text)

with --psm values equal or larger than 6 did yield "Gm" for me.

If the image is stored locally (and in your working directory), just drop the response variable and change the definition of text with the lines

image_name = "J2ojU.png" # or whatever appropriate
text = pytesseract.image_to_string(Image.open(image_name), lang='eng', config='--psm 7')



回答2:


There are several reasons:

  1. Edges are not sharp and continuous (By sharp I mean smooth, not with teeth)

  2. Image is too small, you need to resize

  3. Font is missing (not mandatory, but trained font incredibly improve possibility of recognition)

Based on points 1) and 2) I was able to recognize text.

1) I resized image 3x and 2) I blurred the image to make edges smooth

import pytesseract
import cv2
import numpy as np
import urllib
import requests
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
from PIL import Image

def url_to_image(url):
    resp = urllib.request.urlopen(url)
    image = np.asarray(bytearray(resp.read()), dtype="uint8")
    image = cv2.imdecode(image, cv2.IMREAD_COLOR)
    return image

url = 'https://i.stack.imgur.com/J2ojU.png'

img = url_to_image(url)



retval, img = cv2.threshold(img,200,255, cv2.THRESH_BINARY)
img = cv2.resize(img,(0,0),fx=3,fy=3)
img = cv2.GaussianBlur(img,(11,11),0)
img = cv2.medianBlur(img,9)
cv2.imshow('asd',img)
cv2.waitKey(0)
cv2.destroyAllWindows()
txt = pytesseract.image_to_string(img)
print('recognition:', txt)
>> recognition: Gm

Note:

This script is good for testing any image on web

Note 2:

All processing is based on your posted image

Note 3:

Text recognition is not easy. Every recognition requires special processing. If you try this steps with different image, it may not work at all. Important is to try a lot of recognition on images so you understand what tesseract wants



来源:https://stackoverflow.com/questions/54261255/why-does-tesseract-fail-to-read-text-off-this-simple-image

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!