I am not sure whether it is my infrastucture that does this weird stuff or the tesseract-ocr itself.
Whenever i use image_to_stirng in single-process environment - the
(NOTE the info below is based on review of the pytesseract.py code, I haven't tried to set up a multi-process test to check)
There are several Python libraries that interface with tesseract-ocr
. You are probably using pytesseract
(guessing by the image_to_string
function).
This library calls the tesseract-ocr binary as a subprocess and uses temporary files to interface to it. It uses the obsolete tempfile.mktemp()
which does not guarantee unique file names - further, it does not even use the returned file name as-is, so a second call to tempfile.mktemp()
can easily return the same file name.
Consider using a different python interface library for tesseract: e.g., pip install tesseract-ocr
or python-tesseract
from Google (https://code.google.com/archive/p/python-tesseract/).
(if the problem is actually with the temp files, as I suspect) you may be able to work around this by setting a different temp directory for each of your spawned worker processes:
td = tempfile.mkdtemp()
tempfile.tempdir = td
try:
# your-code-calling pytesseract.image_to_string() or similar
finally:
os.rmdir(td)
tempfile.tempdir = None