Pytesseract foreign language extraction using python

前端 未结 1 869
没有蜡笔的小新
没有蜡笔的小新 2021-02-04 17:23

I am using Python 2.7, Pytesseract-0.1.7 and Tesseract-ocr 3.05.01 on a Windows machine.

I tried to extract text for Korean and Russian languages, and I am positive th

相关标签:
1条回答
  • 2021-02-04 17:40

    You are using Tesseract with a language other than English, so first of all, make sure, that you have learning dataset for your language installed, as it is shown here (linux instructions only).

    Secondly, I strongly suggest you to switch to Python 3 if you are working with non ascii langugages (as I do, as a slovenian). Python 3 works with Unicode out of the box, so it really saves you tons of pain with encoding and decoding strings...

    # python3 obligatory !!!    
    from PIL import Image
    import pytesseract
    
    img = Image.open("T9esw.png")
    img.load()
    text = pytesseract.image_to_string(img, lang="rus")  #Specify language to look after!
    print(text)
    i = 'Сред. Скорость'
    print(i)
    if (text == i):
        print("Match")
    else :
        print("Not Match")
    

    Which outputs:

    Фред скорасть
    Сред. Скорость
    Not Match
    

    This means the words didn't quite match, but still, considering the minimal coding effort and awful quality of input image, it think that the performance is quite amazing. Anyways, the example shows that encoding and decoding should no longer be a problem.

    0 讨论(0)
提交回复
热议问题