Preprocessing image for Tesseract OCR with OpenCV

前端 未结 5 648
既然无缘
既然无缘 2020-12-04 06:07

I\'m trying to develop an App that uses Tesseract to recognize text from documents taken by a phone\'s cam. I\'m using OpenCV to preprocess the image for better recognition,

相关标签:
5条回答
  • 2020-12-04 06:52
    1. Scanning at 300 dpi (dots per inch) is not officially a standard for OCR (optical character recognition), but it is considered the gold standard.

    2. Converting image to Greyscale improves accuracy in reading text in general.

    I have written a module that reads text in Image which in turn process the image for optimum result from OCR, Image Text Reader .

    import tempfile
    
    import cv2
    import numpy as np
    from PIL import Image
    
    IMAGE_SIZE = 1800
    BINARY_THREHOLD = 180
    
    def process_image_for_ocr(file_path):
        # TODO : Implement using opencv
        temp_filename = set_image_dpi(file_path)
        im_new = remove_noise_and_smooth(temp_filename)
        return im_new
    
    def set_image_dpi(file_path):
        im = Image.open(file_path)
        length_x, width_y = im.size
        factor = max(1, int(IMAGE_SIZE / length_x))
        size = factor * length_x, factor * width_y
        # size = (1800, 1800)
        im_resized = im.resize(size, Image.ANTIALIAS)
        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.jpg')
        temp_filename = temp_file.name
        im_resized.save(temp_filename, dpi=(300, 300))
        return temp_filename
    
    def image_smoothening(img):
        ret1, th1 = cv2.threshold(img, BINARY_THREHOLD, 255, cv2.THRESH_BINARY)
        ret2, th2 = cv2.threshold(th1, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
        blur = cv2.GaussianBlur(th2, (1, 1), 0)
        ret3, th3 = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
        return th3
    
    def remove_noise_and_smooth(file_name):
        img = cv2.imread(file_name, 0)
        filtered = cv2.adaptiveThreshold(img.astype(np.uint8), 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 41,
                                         3)
        kernel = np.ones((1, 1), np.uint8)
        opening = cv2.morphologyEx(filtered, cv2.MORPH_OPEN, kernel)
        closing = cv2.morphologyEx(opening, cv2.MORPH_CLOSE, kernel)
        img = image_smoothening(img)
        or_image = cv2.bitwise_or(img, closing)
        return or_image
    
    0 讨论(0)
  • 2020-12-04 06:53

    Note: this should be a comment to Alex I answer, but it's too long so i put it as answer.

    from "An Overview of the Tesseract OCR engine, by Ray Smith, Google Inc." at https://github.com/tesseract-ocr/docs/blob/master/tesseracticdar2007.pdf

    "Processing follows a traditional step-by-step pipeline, but some of the stages were unusual in their day, and possibly remain so even now. The first step is a connected component analysis in which outlines of the components are stored. This was a computationally expensive design decision at the time, but had a significant advantage: by inspection of the nesting of outlines, and the number of child and grandchild outlines, it is simple to detect inverse text and recognize it as easily as black-on-white text. Tesseract was probably the first OCR engine able to handle white-on-black text so trivially."

    So it seems it's not needed to have black text on white background, and should work the opposite too.

    0 讨论(0)
  • 2020-12-04 07:00

    You can play around with the configuration of the OCR by changing the --psm and --oem values, in your case specifically I will suggest using

    --psm 3 --oem 2

    you can also look at the following link for further details here

    0 讨论(0)
  • 2020-12-04 07:01

    I described some tips for preparing images for Tesseract here: Using tesseract to recognize license plates

    In your example, there are several things going on...

    You need to get the text to be black and the rest of the image white (not the reverse). That's what character recognition is tuned on. Grayscale is ok, as long as the background is mostly full white and the text mostly full black; the edges of the text may be gray (antialiased) and that may help recognition (but not necessarily - you'll have to experiment)

    One of the issues you're seeing is that in some parts of the image, the text is really "thin" (and gaps in the letters show up after thresholding), while in other parts it is really "thick" (and letters start merging). Tesseract won't like that :) It happens because the input image is not evenly lit, so a single threshold doesn't work everywhere. The solution is to do "locally adaptive thresholding" where a different threshold is calculated for each neighbordhood of the image. There are many ways of doing that, but check out for example:

    • Adaptive gaussian thresholding in OpenCV with cv2.adaptiveThreshold(...,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,...)
    • Local Otsu's method
    • Local adaptive histogram equalization

    Another problem you have is that the lines aren't straight. In my experience Tesseract can handle a very limited degree of non-straight lines (a few percent of perspective distortion, tilt or skew), but it doesn't really work with wavy lines. If you can, make sure that the source images have straight lines :) Unfortunately, there is no simple off-the-shelf answer for this; you'd have to look into the research literature and implement one of the state of the art algorithms yourself (and open-source it if possible - there is a real need for an open source solution to this). A Google Scholar search for "curved line OCR extraction" will get you started, for example:

    • Text line Segmentation of Curved Document Images

    Lastly: I think you would do much better to work with the python ecosystem (ndimage, skimage) than with OpenCV in C++. OpenCV python wrappers are ok for simple stuff, but for what you're trying to do they won't do the job, you will need to grab many pieces that aren't in OpenCV (of course you can mix and match). Implementing something like curved line detection in C++ will take an order of magnitude longer than in python (* this is true even if you don't know python).

    Good luck!

    0 讨论(0)
  • 2020-12-04 07:04

    I guess you have used the generic approach for Binarization, that is the reason whole image is not binarized uniformly. You can use Adaptive Thresholding technique for binarization. You can also do some skew correction, perspective correction, noise removal for better results.

    Refer to this medium article, to know about the above-mentioned techniques along with code samples.

    0 讨论(0)
提交回复
热议问题