Running threads inside processes

断了今生、忘了曾经 提交于 2020-03-05 06:01:17

问题


Im running image processing on a huge dataset with multiprocessing and Im wondering if running ThreadPoolExecutor inside a Pool provides any benefit vs just simply running Pool on all items.

The dataset contains multiple folders with each folder containing images, so my initial though was to split up each folder in to a process and each image in that folder to a thread. Other way would be to just get every image and run that as a process.

for instance, each folder as a process and each image as a thread

from concurrent import futures
from multiprocessing import Pool
from pathlib import Path


def handle_image(image_path: Path):
    pass


def handle_folder(folder_path: Path):
    with futures.ThreadPoolExecutor() as e:
        e.map(handle_image, folder_path.glob("*"))
        e.shutdown()


if __name__ == '__main__':
    dataset_folder = Path("Folder")
    with Pool() as p:
        p.imap_unordered(handle_folder, dataset_folder.iterdir())
        p.close()
        p.join()

versus each image as a process

from multiprocessing import Pool
from pathlib import Path


def handle_image(image_path: Path):
    if not image_path.is_file():
        return


if __name__ == '__main__':
    dataset_folder = Path("Folder")
    with Pool() as p:
        p.imap_unordered(handle_image, dataset_folder.glob("**/*"), 100)
        p.close()
        p.join()

回答1:


Your task (image processing) sounds CPU-bound, so threads won't have enough idle time to let each other execute unless you are delegating to some C library that releases the GIL for most of the processing.

If, however, processing time is comparable to I/O time, you may get a speedup for up to a few threads per process (cf. 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task for how times compare for a much more I/O-bound task).


As a side note, for large-scale distributed work, you may take a look at one of the 3rd-party implementations of a distributed task queue for Python instead of the built-in pools and map.



来源:https://stackoverflow.com/questions/56486136/running-threads-inside-processes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!