I am using Python 2.7, Pytesseract-0.1.7 and Tesseract-ocr 3.05.01 on a Windows machine.
I tried to extract text for Korean and Russian languages, and I am positive th
You are using Tesseract with a language other than English, so first of all, make sure, that you have learning dataset for your language installed, as it is shown here (linux instructions only).
Secondly, I strongly suggest you to switch to Python 3 if you are working with non ascii langugages (as I do, as a slovenian). Python 3 works with Unicode out of the box, so it really saves you tons of pain with encoding and decoding strings...
# python3 obligatory !!!
from PIL import Image
import pytesseract
img = Image.open("T9esw.png")
img.load()
text = pytesseract.image_to_string(img, lang="rus") #Specify language to look after!
print(text)
i = 'Сред. Скорость'
print(i)
if (text == i):
print("Match")
else :
print("Not Match")
Which outputs:
Фред скорасть
Сред. Скорость
Not Match
This means the words didn't quite match, but still, considering the minimal coding effort and awful quality of input image, it think that the performance is quite amazing. Anyways, the example shows that encoding and decoding should no longer be a problem.