I am using tesseract
for OCR, via the pytesseract
bindings. Unfortunately, I encounter difficulties when trying to extract text including subscript-sty
You want to do apply pre-processing to your image before feeding it into tesseract
to increase the accuracy of the OCR. I use a combination of PIL
and cv2
to do this here because cv2
has good filters for blur/noise removal (dilation, erosion, threshold) and PIL
makes it easy to enhance the contrast (distinguish the text from the background) and I wanted to show how pre-processing could be done using either... (use of both together is not 100% necessary though, as shown below). You can write this more elegantly- it's just the general idea.
import cv2
import pytesseract
import numpy as np
from PIL import Image, ImageEnhance
img = cv2.imread('test.jpg')
def cv2_preprocess(image_path):
img = cv2.imread(image_path)
# convert to black and white if not already
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# remove noise
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# apply a blur
# gaussian noise
img = cv2.threshold(cv2.GaussianBlur(img, (9, 9), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# this can be used for salt and pepper noise (not necessary here)
#img = cv2.adaptiveThreshold(cv2.medianBlur(img, 7), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
cv2.imwrite('new.jpg', img)
return 'new.jpg'
def pil_enhance(image_path):
image = Image.open(image_path)
contrast = ImageEnhance.Contrast(image)
contrast.enhance(2).save('new2.jpg')
return 'new2.jpg'
img = cv2.imread(pil_enhance(cv2_preprocess('test.jpg')))
text = pytesseract.image_to_string(img)
print(text)
Output:
CH3
The cv2
pre-process produces an image that looks like this:
The enhancement with PIL
gives you:
In this specific example, you can actually stop after the cv2_preprocess
step because that is clear enough for the reader:
img = cv2.imread(cv2_preprocess('test.jpg'))
text = pytesseract.image_to_string(img)
print(text)
output:
CH3
But if you are working with things that don't necessarily start with a white background (i.e. grey scaling converts to light grey instead of white)- I have found the PIL
step really helps there.
Main point is the methods to increase accuracy of the tesseract
typically are:
Doing one of these or all three of them will help... but the brightness/noise can be more generalizable than the other two (at least from my experience).