问题
I want to search for pre-defined list of keywords in a given article and increment the score by 1 if keyword is found in article. I want to use multiprocessing since pre-defined list of keyword is very large - 10k keywords and number of article is 100k.
I came across this question but it does not address my question.
I tried this implementation but getting None
as result.
keywords = ["threading", "package", "parallelize"]
def search_worker(keyword):
score = 0
article = """
The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""
if keyword in article:
score += 1
return score
I tried below two method but getting three None
as result.
Method1:
pool = mp.Pool(processes=4)
result = [pool.apply(search_worker, args=(keyword,)) for keyword in keywords]
Method2:
result = pool.map(search_worker, keywords)
print(result)
Actual output: [None, None, None]
Expected output: 3
I think of sending the worker the pre-defined list of keyword and the article all together, but I am not sure if I am going in right direction as I don't have prior experience of multiprocessing.
Thanks in advance.
回答1:
Here's a function using Pool
. You can pass text and keyword_list and it will work. You could use Pool.starmap
to pass tuples of (text, keyword)
, but you would need to deal with an iterable that had 10k references to text
.
from functools import partial
from multiprocessing import Pool
def search_worker(text, keyword):
return int(keyword in text)
def parallel_search_text(text, keyword_list):
processes = 4
chunk_size = 10
total = 0
func = partial(search_worker, text)
with Pool(processes=processes) as pool:
for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
total += result
return total
if __name__ == '__main__':
texts = [] # a list of texts
keywords = [] # a list of keywords
for text in texts:
print(parallel_search_text(text, keywords))
There is overhead in creating a pool of workers. It might be worth timeit-testing this against a simple single-process text search function. Repeat calls can be sped up by creating one instance of Pool
and passing it into the function.
def parallel_search_text2(text, keyword_list, pool):
chunk_size = 10
results = 0
func = partial(search_worker, text)
for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
results += result
return results
if __name__ == '__main__':
pool = Pool(processes=4)
texts = [] # a list of texts
keywords = [] # a list of keywords
for text in texts:
print(parallel_search_text2(text, keywords, pool))
回答2:
User e.s
resolved the main problem in his comment but I'm posting a solution to Om Prakash
's comment requesting to pass in:
both article and pre-defined list of keywords to worker method
Here is a simple way to do that. All you need to do is construct a tuple containing the arguments that you want the worker to process:
from multiprocessing import Pool
def search_worker(article_and_keyword):
# unpack the tuple
article, keyword = article_and_keyword
# count occurrences
score = 0
if keyword in article:
score += 1
return score
if __name__ == "__main__":
# the article and the keywords
article = """The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""
keywords = ["threading", "package", "parallelize"]
# construct the arguments for the search_worker; one keyword per worker but same article
args = [(article, keyword) for keyword in keywords]
# construct the pool and map to the workers
with Pool(3) as pool:
result = pool.map(search_worker, args)
print(result)
If you're on a later version of python I would recommend trying starmap
as that will make this a bit cleaner.
来源:https://stackoverflow.com/questions/48162230/how-to-share-data-between-all-process-in-python-multiprocessing