Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column

后端 未结 3 510
孤城傲影
孤城傲影 2020-12-18 07:57

I am trying to look for potential matches in a PANDAS column full of organization names. I am currently using iterrows() but it is extremely slow on a dataframe with ~70,000

相关标签:
3条回答
  • 2020-12-18 08:42

    This solution leverages apply() and should demonstrate reasonable performance improvements. Feel free to play around with the scorer and change the threshold to meet your needs:

    import pandas as pd, numpy as np
    from fuzzywuzzy import process, fuzz
    
    df = pd.DataFrame([['cliftonlarsonallen llp minneapolis MN'],
            ['loeb and troper llp newyork NY'],
            ["dauby o'connor and zaleski llc carmel IN"],
            ['wegner cpas llp madison WI']],
            columns=['org_name'])
    
    org_list = df['org_name']
    
    threshold = 40
    
    def find_match(x):
    
      match = process.extract(x, org_list, limit=2, scorer=fuzz.partial_token_sort_ratio)[1]
      match = match if match[1]>threshold else np.nan
      return match
    
    df['match found'] = df['org_name'].apply(find_match)
    

    Returns:

                                       org_name                                     match found
    0     cliftonlarsonallen llp minneapolis MN             (wegner cpas llp madison WI, 50, 3)
    1            loeb and troper llp newyork NY             (wegner cpas llp madison WI, 46, 3)
    2  dauby o'connor and zaleski llc carmel IN                                             NaN
    3                wegner cpas llp madison WI  (cliftonlarsonallen llp minneapolis MN, 50, 0)
    

    If you would just like to return the matching string itself, then you can modify as follows:

    match = match[0] if match[1]>threshold else np.nan
    

    I've added @user3483203's comment pertaining to a list comprehension here as an alternative option as well:

    df['match found'] = [find_match(row) for row in df['org_name']]
    

    Note that process.extract() is designed to handle a single query string and apply the passed scoring algorithm to that query and the supplied match options. For that reason, you will have to evaluate that query against all 70,000 match options (the way you currently have your code setup). So therefore, you will be evaluating len(match_options)**2 (or 4,900,000,000) string comparisons. Therefore, I think the best performance improvements could be achieved by limiting the potential match options via more extensive logic in the find_match() function, e.g. enforcing that the match options start with the same letter as the query, etc.

    0 讨论(0)
  • Using iterrows() is not recommended on dataframes, you could use apply() instead. But that probably wouldn't speed things up by much. What is slow is fuzzywuzzy's extract method where your input is compared with all 70k rows (string distance methods are computationally expensive). So if you intend to stick to fuzzywuzzy, one solution would be to limit your search for example to only those with the same first letter. Or if you have another column in your data that could be used as a hint (State, City, ...)

    0 讨论(0)
  • 2020-12-18 08:49

    Given your task your comparing 70k strings with each other using fuzz.WRatio, so your having a total of 4,900,000,000 comparisions, with each of these comparisions using the levenshtein distance inside fuzzywuzzy which is a O(N*M) operation. fuzz.WRatio is a combination of multiple different string matching ratios that have different weights. It then selects the best ratio among them. Therefore it even has to calculate the Levenshtein distance multiple times. So one goal should be to reduce the search space by excluding some possibilities using a way faster matching algorithm. Another issue is that the strings are preprocessed to remove punctuation and to lowercase the strings. While this is required for the matching (so e.g. a uppercased word becomes equal to a lowercased one) we can basically do this ahead of time. So we only have to preprocess the 70k strings once. I will use RapidFuzz instead of FuzzyWuzzy here, since it is quite a bit faster (I am the author).

    The following version performs more than 10 times as fast as your previous solution in my experiments and applies the following improvements:

    1) it generates a dict mapping the organisations to the preprocessed organisations so this does not has to be done in each run

    2) it passes a score_cutoff to extractOne so it can skip calculations where it already knows they can not reach this ratio

    import pandas as pd, numpy as np
    from rapidfuzz import process, utils
    
    org_list = df['org_name']
    processed_orgs = {org: utils.default_process(org) for org in org_list}
    
    for (i, (query, processed_query)) in enumerate(processed_orgs.items()):
        choices = processed_orgs.copy()
        del choices[query]
        match = process.extractOne(processed_query, choices, processor=None, score_cutoff=93)
        if match:
            df.loc[i, 'fuzzy_match'] = match[2]
            df.loc[i, 'fuzzy_match_score'] = match[1]
    

    Here is a list of the most relevant improvements of RapidFuzz to make it faster than FuzzyWuzzy in this example:

    1) It is implemented fully in C++ while a big part of FuzzyWuzzy is implemented in Python

    2) When calculating the levenshtein distance it takes into account the score_cutoff to exit early when the score can not be reached. This way it can exit in O(1) when the length difference between the strings is to big or O(N) when there are to many uncommon characters between the two strings while calculating the Levenshtein distance has a time complexity of O(N*M)

    3) fuzz.WRatio is combining the results of multiple other string matching algorithms like fuzz.ratio, fuzz.token_sort_ratio and fuzz.token_set_ratio and takes the maximum result after weighting them. So while fuzz.ratio has a weighting of 1 fuzz.token_sort_ratio and fuzz.token_set_ratio have one of 0.95. When the score_cutoff is bigger than 95 fuzz.token_sort_ratio and fuzz.token_set_ratio are not calculated anymore, since the results are guaranteed to be smaller than the score_cutoff

    4) since extractOne only searches for the best match it uses the ratio of the current best match as score_cutoff for the next elements. This way it can quickly discard more elements by using the improvements to the levenshtein distance calculation from 2) in many cases

    0 讨论(0)
提交回复
热议问题