How to detect subscript numbers in an image using OCR?

前端 未结 3 1511
抹茶落季
抹茶落季 2021-02-14 11:46

I am using tesseract for OCR, via the pytesseract bindings. Unfortunately, I encounter difficulties when trying to extract text including subscript-sty

相关标签:
3条回答
  • 2021-02-14 12:32

    You want to do apply pre-processing to your image before feeding it into tesseract to increase the accuracy of the OCR. I use a combination of PIL and cv2 to do this here because cv2 has good filters for blur/noise removal (dilation, erosion, threshold) and PIL makes it easy to enhance the contrast (distinguish the text from the background) and I wanted to show how pre-processing could be done using either... (use of both together is not 100% necessary though, as shown below). You can write this more elegantly- it's just the general idea.

    import cv2
    import pytesseract
    import numpy as np
    from PIL import Image, ImageEnhance
    
    
    img = cv2.imread('test.jpg')
    
    def cv2_preprocess(image_path):
      img = cv2.imread(image_path)
    
      # convert to black and white if not already
      img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
      # remove noise
      kernel = np.ones((1, 1), np.uint8)
      img = cv2.dilate(img, kernel, iterations=1)
      img = cv2.erode(img, kernel, iterations=1)
    
      # apply a blur 
      # gaussian noise
      img = cv2.threshold(cv2.GaussianBlur(img, (9, 9), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    
      # this can be used for salt and pepper noise (not necessary here)
      #img = cv2.adaptiveThreshold(cv2.medianBlur(img, 7), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
    
      cv2.imwrite('new.jpg', img)
      return 'new.jpg'
    
    def pil_enhance(image_path):
      image = Image.open(image_path)
      contrast = ImageEnhance.Contrast(image)
      contrast.enhance(2).save('new2.jpg')
      return 'new2.jpg'
    
    
    img = cv2.imread(pil_enhance(cv2_preprocess('test.jpg')))
    
    
    text = pytesseract.image_to_string(img)
    print(text)
    

    Output:

    CH3
    

    The cv2 pre-process produces an image that looks like this:

    The enhancement with PIL gives you:

    In this specific example, you can actually stop after the cv2_preprocess step because that is clear enough for the reader:

    img = cv2.imread(cv2_preprocess('test.jpg'))
    text = pytesseract.image_to_string(img)
    print(text)
    

    output:

    CH3
    

    But if you are working with things that don't necessarily start with a white background (i.e. grey scaling converts to light grey instead of white)- I have found the PIL step really helps there.

    Main point is the methods to increase accuracy of the tesseract typically are:

    1. fix DPI (rescaling)
    2. fix brightness/noise of image
    3. fix tex size/lines (skewing/warping text)

    Doing one of these or all three of them will help... but the brightness/noise can be more generalizable than the other two (at least from my experience).

    0 讨论(0)
  • 2021-02-14 12:38

    I think this way can be more suitable for the general situation.

    import cv2
    import pytesseract
    from pathlib import Path
    
    image = cv2.imread('test.jpg')
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]  # (suitable for sharper black and white pictures
    contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    contours = contours[0] if len(contours) == 2 else contours[1]  # is OpenCV2.4 or OpenCV3
    result_list = []
    for c in contours:
        x, y, w, h = cv2.boundingRect(c)
        area = cv2.contourArea(c)
        if area > 200:
            detect_area = image[y:y + h, x:x + w]
            # detect_area = cv2.GaussianBlur(detect_area, (3, 3), 0)
            predict_char = pytesseract.image_to_string(detect_area, lang='eng', config='--oem 0 --psm 10')
            result_list.append((x, predict_char))
            cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), thickness=2)
    
    result = ''.join([char for _, char in sorted(result_list, key=lambda _x: _x[0])])
    print(result)  # CH3
    
    
    output_dir = Path('./temp')
    output_dir.mkdir(parents=True, exist_ok=True)
    cv2.imwrite(f"{output_dir/Path('image.png')}", image)
    cv2.imwrite(f"{output_dir/Path('clean.png')}", thresh)
    
    

    MORE REFERENCE

    I strongly suggest you refer to the following examples, which is a useful reference for OCR.

    1. Get the location of all text present in image using opencv
    2. Using YOLO or other image recognition techniques to identify all alphanumeric text present in images

    0 讨论(0)
  • 2021-02-14 12:45

    This is because the font of subscript is too small. You could resize the image using a python package such as cv2 or PIL and use the resized image for OCR as coded below.

    import pytesseract
    import cv2
    
    img = cv2.imread('test.jpg')
    img = cv2.resize(img, None, fx=2, fy=2)  # scaling factor = 2
    
    data = pytesseract.image_to_string(img)
    print(data)
    

    OUTPUT:

    CH3
    
    0 讨论(0)
提交回复
热议问题