How can I train my Python based OCR with Tesseract to train with different National Identity Cards?

前端 未结 1 1007
有刺的猬
有刺的猬 2020-12-29 15:16

I am working with python to make an OCR system that reads from the ID Cards and give the exact results from the image but it is not giving me the righteous answers as there

相关标签:
1条回答
  • 2020-12-29 15:50

    Steps to improve Pytesseract recognition:

    1) Clean your image arrays so there is only text(font generated, not handwritten). The edges of letters should be without distortion. Apply threshold (try different values). Also apply some smoothing filters. I also recommend to use Morpholofical opening/closing - but thats only a bonus. This is exaggerated example of what should enter pytesseract recognition in form of array: https://i.ytimg.com/vi/1ns8tGgdpLY/maxresdefault.jpg

    2) Resize the image with text you want to recognize to higher resolution

    3) Pytesseract should generally recognize letters of any kind, but by installing font in which the text is written, you are superbly increasing accuracy.

    How to install new fonts into pytesseract:

    1) Get your desired font in TIFF format

    2) Upload it to http://trainyourtesseract.com/ and receive trained data into your email

    3) add the trained data file (*.traineddata) to this folder C:\Program Files (x86)\Tesseract-OCR\tessdata

    4) add this string command to pytesseract reconition function:

    • lets say you have 2 trained fonts: font1.traineddata and font2.traineddata

    • To use both, use this command

      txt = pytesseract.image_to_string(img, lang='font1+font2')

    Here is a code to test your recognition on web images:

    import cv2
    import pytesseract
    import cv2
    import numpy as np
    import urllib
    import requests
    pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
    TESSDATA_PREFIX = 'C:/Program Files (x86)/Tesseract-OCR'
    from PIL import Image
    
    def url_to_image(url):
        resp = urllib.request.urlopen(url)
        image = np.asarray(bytearray(resp.read()), dtype="uint8")
        image = cv2.imdecode(image, cv2.IMREAD_COLOR)
        return image
    
    url='http://jeroen.github.io/images/testocr.png'
    
    
    img = url_to_image(url)
    
    
    #img = cv2.GaussianBlur(img,(5,5),0)
    img = cv2.medianBlur(img,5) 
    retval, img = cv2.threshold(img,150,255, cv2.THRESH_BINARY)
    txt = pytesseract.image_to_string(img, lang='eng')
    print('recognition:', txt)
    >>> txt
    'This ts a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format\n\nThe quick brown dog jumped over the\nlazy fox The quick brown dog jumped\nover the lazy fox The quick brown dog\njumped over the lazy fox The quick\nbrown dog jumped over the lazy fox'
    
    0 讨论(0)
提交回复
热议问题