I want to find a easy-to-use OCR python module in linux, I have found pytesser http://code.google.com/p/pytesser/, but it contains a .exe executable file.
I tried ch
python tesseract
http://code.google.com/p/python-tesseract
import cv2.cv as cv
import tesseract
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)
image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)
tesseract.SetCvImage(image,api)
text=api.GetUTF8Text()
conf=api.MeanTextConf()
You have a bunch of options here.
One way, as others pointed out is to use tesseract. Looks like there are a bunch of wrappers by now, so best way is to do a quick pypi search for it. The most used ones these days are:
Another useful site for finding similar engines is alternative.to. A few linux based systems according to them are:
You can just wrap tesseract
in a function:
import os
import tempfile
import subprocess
def ocr(path):
temp = tempfile.NamedTemporaryFile(delete=False)
process = subprocess.Popen(['tesseract', path, temp.name], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
process.communicate()
with open(temp.name + '.txt', 'r') as handle:
contents = handle.read()
os.remove(temp.name + '.txt')
os.remove(temp.name)
return contents
If you want document segmentation and more advanced features, try out OCRopus.
You should try the excellent scikits.learn libraries for machine learning. You can find two codes that are ready to run here and here.
In addition to Blender's answer, that just executs Tesseract executable, I would like to add that there exist other alternatives for OCR that can also be called as external process.
ABBYY comand line OCR utility: http://ocr4linux.com/en:start
It is not free, so worth to consider only if Tesseract accuracy is not good enough for your task, or you need more sophisticated layout analisys or you need to export PDF, Word and other files.
Update: here's comparison of ABBYY and tesseract accuracy: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison
Disclaimer: I work for ABBYY