发表新帖

发表新帖

Pytesseract foreign language extraction using python

前端未结

关注

 1  869

没有蜡笔的小新

I am using Python 2.7, Pytesseract-0.1.7 and Tesseract-ocr 3.05.01 on a Windows machine.

I tried to extract text for Korean and Russian languages, and I am positive th

相关标签:

1条回答

忘掉有多难

2021-02-04 17:40
You are using Tesseract with a language other than English, so first of all, make sure, that you have learning dataset for your language installed, as it is shown here (linux instructions only).

Secondly, I strongly suggest you to switch to Python 3 if you are working with non ascii langugages (as I do, as a slovenian). Python 3 works with Unicode out of the box, so it really saves you tons of pain with encoding and decoding strings...
```
# python3 obligatory !!!    
from PIL import Image
import pytesseract

img = Image.open("T9esw.png")
img.load()
text = pytesseract.image_to_string(img, lang="rus")  #Specify language to look after!
print(text)
i = 'Сред. Скорость'
print(i)
if (text == i):
    print("Match")
else :
    print("Not Match")
```
Which outputs:
```
Фред скорасть
Сред. Скорость
Not Match
```
This means the words didn't quite match, but still, considering the minimal coding effort and awful quality of input image, it think that the performance is quite amazing. Anyways, the example shows that encoding and decoding should no longer be a problem.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题