How to share data between all process in Python multiprocessing?

不羁的心 提交于 2020-05-15 04:45:20

问题


I want to search for pre-defined list of keywords in a given article and increment the score by 1 if keyword is found in article. I want to use multiprocessing since pre-defined list of keyword is very large - 10k keywords and number of article is 100k.

I came across this question but it does not address my question.

I tried this implementation but getting None as result.

keywords = ["threading", "package", "parallelize"]

def search_worker(keyword):
    score = 0
    article = """
    The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""

   if keyword in article:
        score += 1
    return score

I tried below two method but getting three None as result.

Method1:

 pool = mp.Pool(processes=4)
 result = [pool.apply(search_worker, args=(keyword,)) for keyword in keywords]

Method2:

result = pool.map(search_worker, keywords)
print(result)

Actual output: [None, None, None]

Expected output: 3

I think of sending the worker the pre-defined list of keyword and the article all together, but I am not sure if I am going in right direction as I don't have prior experience of multiprocessing.

Thanks in advance.


回答1:


Here's a function using Pool. You can pass text and keyword_list and it will work. You could use Pool.starmap to pass tuples of (text, keyword), but you would need to deal with an iterable that had 10k references to text.

from functools import partial
from multiprocessing import Pool

def search_worker(text, keyword):
    return int(keyword in text)

def parallel_search_text(text, keyword_list):
    processes = 4
    chunk_size = 10
    total = 0
    func = partial(search_worker, text)
    with Pool(processes=processes) as pool:
        for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
            total += result

    return total

if __name__ == '__main__':
    texts = []  # a list of texts
    keywords = []  # a list of keywords
    for text in texts:
        print(parallel_search_text(text, keywords))

There is overhead in creating a pool of workers. It might be worth timeit-testing this against a simple single-process text search function. Repeat calls can be sped up by creating one instance of Pool and passing it into the function.

def parallel_search_text2(text, keyword_list, pool):
    chunk_size = 10
    results = 0
    func = partial(search_worker, text)

    for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
        results += result
    return results

if __name__ == '__main__':
    pool = Pool(processes=4)
    texts = []  # a list of texts
    keywords = []  # a list of keywords
    for text in texts:
        print(parallel_search_text2(text, keywords, pool))



回答2:


User e.s resolved the main problem in his comment but I'm posting a solution to Om Prakash's comment requesting to pass in:

both article and pre-defined list of keywords to worker method

Here is a simple way to do that. All you need to do is construct a tuple containing the arguments that you want the worker to process:

from multiprocessing import Pool

def search_worker(article_and_keyword):
    # unpack the tuple
    article, keyword = article_and_keyword

    # count occurrences
    score = 0
    if keyword in article:
        score += 1

    return score

if __name__ == "__main__":
    # the article and the keywords
    article = """The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""
    keywords = ["threading", "package", "parallelize"]

    # construct the arguments for the search_worker; one keyword per worker but same article
    args = [(article, keyword) for keyword in keywords]

    # construct the pool and map to the workers
    with Pool(3) as pool:
        result = pool.map(search_worker, args)
    print(result)

If you're on a later version of python I would recommend trying starmap as that will make this a bit cleaner.



来源:https://stackoverflow.com/questions/48162230/how-to-share-data-between-all-process-in-python-multiprocessing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!