Python Fuzzy matching strings in list performance

后端 未结 1 1274
日久生厌
日久生厌 2021-01-05 12:26

I\'m checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x

相关标签:
1条回答
  • 2021-01-05 13:09

    Major speed improvements come by writing vectorized operations and avoiding loops

    Importing necessary package

    from fuzzywuzzy import fuzz
    import pandas as pd
    import numpy as np
    

    Creating dataframe from first list

    dataframecolumn = pd.DataFrame(["apple","tb"])
    dataframecolumn.columns = ['Match']
    

    Creating dataframe from second list

    compare = pd.DataFrame(["adfad","apple","asple","tab"])
    compare.columns = ['compare']
    

    Merge - Cartesian product by introducing key(self join)

    dataframecolumn['Key'] = 1
    compare['Key'] = 1
    combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")
    combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]
    

    Vectorization

    def partial_match(x,y):
        return(fuzz.ratio(x,y))
    partial_match_vector = np.vectorize(partial_match)
    

    Using vectorization and getting desired result by putting threshold on score

    combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])
    combined_dataframe = combined_dataframe[combined_dataframe.score>=80]
    

    Results

    +--------+-----+--------+------+
    | Match  | Key | compare | score
    +--------+-----+--------+------+
    | apple  | 1   |   asple |    80
    |  tb    | 1   |   tab   |    80
    +--------+-----+--------+------+
    
    0 讨论(0)
提交回复
热议问题