is it possible to do fuzzy match merge with python pandas?

前端 未结 11 1472
[愿得一人]
[愿得一人] 2020-11-22 01:17

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I woul

11条回答
  •  北恋
    北恋 (楼主)
    2020-11-22 01:46

    I used Fuzzymatcher package and this worked well for me. Visit this link for more details on this.

    use the below command to install

    pip install fuzzymatcher
    

    Below is the sample Code (already submitted by RobinL above)

    from fuzzymatcher import link_table, fuzzy_left_join
    
    # Columns to match on from df_left
    left_on = ["fname", "mname", "lname",  "dob"]
    
    # Columns to match on from df_right
    right_on = ["name", "middlename", "surname", "date"]
    
    # The link table potentially contains several matches for each record
    fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
    

    Errors you may get

    1. ZeroDivisionError: float division by zero---> Refer to this link to resolve it
    2. OperationalError: No Such Module:fts4 --> downlaod the sqlite3.dll from here and replace the DLL file in your python or anaconda DLLs folder.

    Pros :

    1. Works faster. In my case, I compared one dataframe with 3000 rows with anohter dataframe with 170,000 records . This also uses SQLite3 search across text. So faster than many
    2. Can check across multiple columns and 2 dataframes. In my case, I was looking for closest match based on address and company name. Sometimes, company name might be same but address is the good thing to check too.
    3. Gives you score for all the closest matches for the same record. you choose whats the cutoff score.

    cons:

    1. Original package installation is buggy
    2. Required C++ and visual studios installed too
    3. Wont work for 64 bit anaconda/Python

提交回复
热议问题