问题
I'm struggling for some time to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code. The output could be also stored in memory, and written to a file afterwards.
I am new to both Python and parallelism, so I find it difficult to apply the concepts explained here and here. I also found this question, but I couldn't manage to figure out how to implement the same for my situation. I am working on a Windows platform, using Python 3.4.
for i in range(0, len(unique_words)):
max_similarity = 0
max_similarity_word = ""
for j in range(0, len(unique_words)):
if not i == j:
similarity = calculate_similarity(global_map[unique_words[i]], global_map[unique_words[j]])
if similarity > max_similarity:
max_similarity = similarity
max_similarity_word = unique_words[j]
file_co_occurring.write(
unique_words[i] + "\t" + max_similarity_word + "\t" + str(max_similarity) + "\n")
If you need an explanation for the code:
unique_words
is a list of words (strings)global_map
is a dictionary whose keys are words(global_map.keys()
contains the same elements asunique_words
) and the values are dictionaries of the following format: {word: value}, where the words are a subset of the values inunique_words
- for each word, I look for the most similar word based on its value in
global_map
. I wouldn't prefer to store each similarity in memory since the maps already take too much. calculate_similarity
returns a value from 0 to 1- the result should contain the most similar word for each of the words in
unique_words
(the most similar word should be different than the word itself, that's why I added the conditionif not i == j
, but this can be also done if I check ifmax_similarity
is different than 1) - if the
max_similarity
for a word is 0, it's OK if the most similar word is the empty string
回答1:
Here is a solution that should work for you. I ended up changing a lot of your code so please ask if you have any questions.
This is far from the only way to accomplish this, and in particular this is not a memory efficient solution.
You will need to set max_workers to something that works for you. Usually the number of logical processors in your machine is a good starting point.
from concurrent.futures import ThreadPoolExecutor, Future
from itertools import permutations
from collections import namedtuple, defaultdict
Result = namedtuple('Result', ('value', 'word'))
def new_calculate_similarity(word1, word2):
return Result(
calculate_similarity(global_map[word1], global_map[word2]),
word2)
with ThreadPoolExecutor(max_workers=4) as executer:
futures = defaultdict(list)
for word1, word2 in permutations(unique_words, r=2):
futures[word1].append(
executer.submit(new_calculate_similarity, word1, word2))
for word in futures:
# this will block until all calculations have completed for 'word'
results = map(Future.result, futures[word])
max_result = max(results, key=lambda r: r.value)
print(word, max_result.word, max_result.value,
sep='\t',
file=file_co_occurring)
Here are the docs for the libraries I used:
- Futures
- collections
- itertools
来源:https://stackoverflow.com/questions/29217088/parallelize-a-nested-for-loop-in-python-for-finding-the-max-value