How to do multiprocessing in python on 2m rows running fuzzywuzzy string matching logic? Current code is extremely slow

↘锁芯ラ 提交于 2019-12-12 06:30:45

问题


I am new to python and I'm running a fuzzywuzzy string matching logic on a list with 2 million records. The code is working and it is giving output as well. The problem is that it is extremely slow. In 3 hours it processes only 80 rows. I want to speed things up by making it process multiple rows at once.

If it it helps - I am running it on my machine with 16Gb RAM and 1.9 GHz dual core CPU.

Below is the code I'm running.

d = []
n = len(Africa_Company) #original list with 2m string records
for i in range(1,n):
    choices = Africa_Company[i+1:n]
    word = Africa_Company[i]
    try:
        output= process.extractOne(str(word), str(choices), score_cutoff=85)
    except Exception:
        print (word) #to identify which string is throwing an exception
    print (i) #to know how many rows are processed, can do without this also
    if output:
        d.append({'Company':Africa_Company[i], 
                  'NewCompany':output[0],
                  'Score':output[1], 
                  'Region':'Africa'})
    else:
        d.append({'Company':Africa_Company[i], 
                  'NewCompany':None,
                  'Score':None, 
                  'Region':'Africa'})


Africa_Corrected = pd.DataFrame(d) #output data in a pandas dataframe

Thanks in advance !


回答1:


This is a CPU-bound problem. By going parallel you can just speed it up by a factor of two at most (because you have two cores). What you really should do is speed up the single-thread performance. Levenshtein distance is quite slow so there are lots of opportunity to speed things up.

  1. Use pruning. Don't try to run the full fuzzywuzzy match between two strings if there is no way it will give a good result. Try to find a simple linear algorithm to filter out irrelevant choices before the fuzzywuzzy match.
  2. Consider indexing. Is there some way you can index your list? For example: if your matching is based on whole words, create a hashmap that maps words to strings. Only try to match against choices that have at least one word in common with your current string.
  3. Preprocessing. Is there some work done on the strings in every match that you can preprocess? If, for example, your Levenshtein implementation starts by creating sets out of your strings, consider creating all sets first so you don't have to do the same work over and over in each match.
  4. Is there some better algorithm to use? Maybe Levenshtein distance is not the best algorithm to begin with.
  5. Is the implementation of Levenshtein distance you're using optimal? This goes back to step 3 (preprocessing). Are there other things you can do to speed up the runtime?

Multiprocessing will only speed up with a constant factor (depending on the number of cores). Indexing can take you to a lower complexity class! So focus on pruning and indexing first, then steps 3-5. Only when you squeezed enough out of these steps should you consider multiprocessing.



来源:https://stackoverflow.com/questions/41571358/how-to-do-multiprocessing-in-python-on-2m-rows-running-fuzzywuzzy-string-matchin

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!