How to detect subscript numbers in an image using OCR?

前端未结

关注

 3  1511

I am using tesseract for OCR, via the pytesseract bindings. Unfortunately, I encounter difficulties when trying to extract text including subscript-sty

相关标签:

3条回答

逝去的感伤

2021-02-14 12:32
You want to do apply pre-processing to your image before feeding it into tesseract to increase the accuracy of the OCR. I use a combination of PIL and cv2 to do this here because cv2 has good filters for blur/noise removal (dilation, erosion, threshold) and PIL makes it easy to enhance the contrast (distinguish the text from the background) and I wanted to show how pre-processing could be done using either... (use of both together is not 100% necessary though, as shown below). You can write this more elegantly- it's just the general idea.
```
import cv2
import pytesseract
import numpy as np
from PIL import Image, ImageEnhance


img = cv2.imread('test.jpg')

def cv2_preprocess(image_path):
  img = cv2.imread(image_path)

  # convert to black and white if not already
  img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

  # remove noise
  kernel = np.ones((1, 1), np.uint8)
  img = cv2.dilate(img, kernel, iterations=1)
  img = cv2.erode(img, kernel, iterations=1)

  # apply a blur 
  # gaussian noise
  img = cv2.threshold(cv2.GaussianBlur(img, (9, 9), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

  # this can be used for salt and pepper noise (not necessary here)
  #img = cv2.adaptiveThreshold(cv2.medianBlur(img, 7), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)

  cv2.imwrite('new.jpg', img)
  return 'new.jpg'

def pil_enhance(image_path):
  image = Image.open(image_path)
  contrast = ImageEnhance.Contrast(image)
  contrast.enhance(2).save('new2.jpg')
  return 'new2.jpg'


img = cv2.imread(pil_enhance(cv2_preprocess('test.jpg')))


text = pytesseract.image_to_string(img)
print(text)
```
Output:
```
CH3
```
The cv2 pre-process produces an image that looks like this:

The enhancement with PIL gives you:

In this specific example, you can actually stop after the cv2_preprocess step because that is clear enough for the reader:
```
img = cv2.imread(cv2_preprocess('test.jpg'))
text = pytesseract.image_to_string(img)
print(text)
```
output:
```
CH3
```
But if you are working with things that don't necessarily start with a white background (i.e. grey scaling converts to light grey instead of white)- I have found the PIL step really helps there.

Main point is the methods to increase accuracy of the tesseract typically are:
1. fix DPI (rescaling)
2. fix brightness/noise of image
3. fix tex size/lines (skewing/warping text)
Doing one of these or all three of them will help... but the brightness/noise can be more generalizable than the other two (at least from my experience).
0 讨论(0)
发布评论:

提交评论
- 加载中...

走了就别回头了

2021-02-14 12:38

I think this way can be more suitable for the general situation.

import cv2
import pytesseract
from pathlib import Path

image = cv2.imread('test.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]  # (suitable for sharper black and white pictures
contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
contours = contours[0] if len(contours) == 2 else contours[1]  # is OpenCV2.4 or OpenCV3
result_list = []
for c in contours:
    x, y, w, h = cv2.boundingRect(c)
    area = cv2.contourArea(c)
    if area > 200:
        detect_area = image[y:y + h, x:x + w]
        # detect_area = cv2.GaussianBlur(detect_area, (3, 3), 0)
        predict_char = pytesseract.image_to_string(detect_area, lang='eng', config='--oem 0 --psm 10')
        result_list.append((x, predict_char))
        cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), thickness=2)

result = ''.join([char for _, char in sorted(result_list, key=lambda _x: _x[0])])
print(result)  # CH3


output_dir = Path('./temp')
output_dir.mkdir(parents=True, exist_ok=True)
cv2.imwrite(f"{output_dir/Path('image.png')}", image)
cv2.imwrite(f"{output_dir/Path('clean.png')}", thresh)

MORE REFERENCE

I strongly suggest you refer to the following examples, which is a useful reference for OCR.

Get the location of all text present in image using opencv
Using YOLO or other image recognition techniques to identify all alphanumeric text present in images

0 讨论(0)

误落风尘

2021-02-14 12:45
This is because the font of subscript is too small. You could resize the image using a python package such as cv2 or PIL and use the resized image for OCR as coded below.
```
import pytesseract
import cv2

img = cv2.imread('test.jpg')
img = cv2.resize(img, None, fx=2, fy=2)  # scaling factor = 2

data = pytesseract.image_to_string(img)
print(data)
```
OUTPUT:
```
CH3
```
0 讨论(0)
发布评论:

提交评论
- 加载中...