Parallelize a nested for loop in python for finding the max value

二次信任 提交于 2020-02-04 05:48:45

问题


I'm struggling for some time to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code. The output could be also stored in memory, and written to a file afterwards.

I am new to both Python and parallelism, so I find it difficult to apply the concepts explained here and here. I also found this question, but I couldn't manage to figure out how to implement the same for my situation. I am working on a Windows platform, using Python 3.4.

for i in range(0, len(unique_words)):
    max_similarity = 0        
    max_similarity_word = ""
    for j in range(0, len(unique_words)):
        if not i == j:
            similarity = calculate_similarity(global_map[unique_words[i]], global_map[unique_words[j]])
            if similarity > max_similarity:
                 max_similarity = similarity
                 max_similarity_word = unique_words[j]
    file_co_occurring.write(
        unique_words[i] + "\t" + max_similarity_word + "\t" + str(max_similarity) + "\n")

If you need an explanation for the code:

  • unique_words is a list of words (strings)
  • global_map is a dictionary whose keys are words(global_map.keys() contains the same elements as unique_words) and the values are dictionaries of the following format: {word: value}, where the words are a subset of the values in unique_words
  • for each word, I look for the most similar word based on its value in global_map. I wouldn't prefer to store each similarity in memory since the maps already take too much.
  • calculate_similarity returns a value from 0 to 1
  • the result should contain the most similar word for each of the words in unique_words (the most similar word should be different than the word itself, that's why I added the condition if not i == j, but this can be also done if I check if max_similarity is different than 1)
  • if the max_similarity for a word is 0, it's OK if the most similar word is the empty string

回答1:


Here is a solution that should work for you. I ended up changing a lot of your code so please ask if you have any questions.

This is far from the only way to accomplish this, and in particular this is not a memory efficient solution.

You will need to set max_workers to something that works for you. Usually the number of logical processors in your machine is a good starting point.

from concurrent.futures import ThreadPoolExecutor, Future
from itertools import permutations
from collections import namedtuple, defaultdict

Result = namedtuple('Result', ('value', 'word'))

def new_calculate_similarity(word1, word2):
    return Result(
        calculate_similarity(global_map[word1], global_map[word2]),
        word2)

with ThreadPoolExecutor(max_workers=4) as executer:
    futures = defaultdict(list)
    for word1, word2 in permutations(unique_words, r=2):
            futures[word1].append(
                executer.submit(new_calculate_similarity, word1, word2))

    for word in futures:
        # this will block until all calculations have completed for 'word'
        results = map(Future.result, futures[word])
        max_result = max(results, key=lambda r: r.value) 
        print(word, max_result.word, max_result.value, 
            sep='\t', 
            file=file_co_occurring)

Here are the docs for the libraries I used:

  • Futures
  • collections
  • itertools


来源:https://stackoverflow.com/questions/29217088/parallelize-a-nested-for-loop-in-python-for-finding-the-max-value

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!