Tesseract 3.x multiprocessing weird behaviour

后端 未结 1 1627
花落未央
花落未央 2021-02-09 23:30

I am not sure whether it is my infrastucture that does this weird stuff or the tesseract-ocr itself.

Whenever i use image_to_stirng in single-process environment - the

1条回答
  •  暖寄归人
    2021-02-10 00:06

    (NOTE the info below is based on review of the pytesseract.py code, I haven't tried to set up a multi-process test to check)

    There are several Python libraries that interface with tesseract-ocr. You are probably using pytesseract (guessing by the image_to_string function).

    This library calls the tesseract-ocr binary as a subprocess and uses temporary files to interface to it. It uses the obsolete tempfile.mktemp() which does not guarantee unique file names - further, it does not even use the returned file name as-is, so a second call to tempfile.mktemp() can easily return the same file name.

    Consider using a different python interface library for tesseract: e.g., pip install tesseract-ocr or python-tesseract from Google (https://code.google.com/archive/p/python-tesseract/).

    (if the problem is actually with the temp files, as I suspect) you may be able to work around this by setting a different temp directory for each of your spawned worker processes:

    td = tempfile.mkdtemp()
    tempfile.tempdir = td
    try:
        # your-code-calling pytesseract.image_to_string() or similar
    finally:
        os.rmdir(td)
        tempfile.tempdir = None
    

    0 讨论(0)
提交回复
热议问题