PyTesseract call working very slow when used along with multiprocessing

后端 未结 1 1315
时光说笑
时光说笑 2021-01-06 12:19

I\'ve a function that takes in a list of images and produces the output, in a list, after applying OCR to the image. I have an another function that controls the input to th

相关标签:
1条回答
  • 2021-01-06 13:21

    I'm thepathos author. If your code takes 1s to run serially, then it's quite possible that it will take longer to run in naive process parallel. There is overhead to working with naive process parallel:

    1. a new python instance has to be spun up on each processor
    2. your function and dependencies need to get serialized and sent to each processor
    3. your data needs to get serialized and sent to the processors
    4. the same for deserialization
    5. you can run into memory issues from either long-live pools or lots of data serialization.

    I'd suggest checking a few simple things to check where your issues might be:

    • try the pathos.pools.ThreadPool to use thread parallel instead of process parallel. This can reduce some of the overhead for serialization and spinning up the pool.
    • try the pathos.pools._ProcessPool to change how pathos manages the pool. Without the underscore, pathos keeps the pool around as a singleton, and requires a 'terminate' to explicitly kill the pool. With the underscore, the pool dies when you delete the pool object. Note that your caller function does not close or join (or terminate) the pool.
    • you might want to check how much you are serializing by trying to dill.dumps one of the elements you are trying to process in parallel. Things like big numpy arrays can take a while to serialize. If the size of what is being passed around is large, you might consider using a shared memory array (i.e. a multiprocess.Array or the equivalent version for numpy arrays -- also see: numpy.ctypeslib) to minimize what is being passed between each process.

    The latter is a bit more work, but can provide huge savings if you have a lot to serialize. There is no shared memory pool, so you have to do a for loop over the individual multiprocess.Process objects if you need to go that route.

    0 讨论(0)
提交回复
热议问题