I\'m trying to read different cropped images from a big file and I manage to read most of them but there are some of them which return an empty string when I try to read the
Thresholding the image before passing it to pytesseract
increases the accuracy.
import cv2
import numpy as np
# Grayscale image
img = Image.open('num.png').convert('L')
ret,img = cv2.threshold(np.array(img), 125, 255, cv2.THRESH_BINARY)
# Older versions of pytesseract need a pillow image
# Convert back if needed
img = Image.fromarray(img.astype(np.uint8))
print(pytesseract.image_to_string(img))
This printed out
5.78 / C02
Edit:
Doing just thresholding on the second image returns 11.1
. Another step that can help is to set the page segmentation mode to "Treat the image as a single text line." with the config --psm 7
. Doing this on the second image returns 11.1 "202 '
, with the quotation marks coming from the partial text at the top. To ignore those, you can also set what characters to search for with a whitelist by the config -c tessedit_char_whitelist=0123456789.%
. Everything together:
pytesseract.image_to_string(img, config='--psm 7 -c tessedit_char_whitelist=0123456789.%')
This returns 11.1 202
. Clearly pytesseract is having a hard time with that percent symbol, which I'm not sure how to improve on that with image processing or config changes.