Similarity between 2 dataframe columns

爷,独闯天下 提交于 2021-02-08 07:56:07

问题


I have two dataframes and each have a column called Song. However sometimes the songs are spelled differently. How can I used difflib (or something similar) to get the Song spelling of one dataframe in a new column of the other dataframe?

ex:

Dataframe1

Song           Artist

like a virgi   madonna


Dataframe2

Song          Rank

like a virgin  2


Result

Song            Artist    SongAlt

like a virgin   Madonna   like a virgi

回答1:


Step 1: Merge whatever can be merged

In [67]: df1
Out[67]: 
           Song    Artist
0        mysong  myartist
1  like a virgi   madonna

In [68]: df2
Out[68]: 
            Song  Rank
0         mysong     1
1  like a virgin     2

In [69]: merged = pd.merge(df1, df2, on='Song')

In [70]: merged
Out[70]: 
     Song    Artist  Rank
0  mysong  myartist     1

Step 2: Find out what's remaining

In [71]: unmerged = df2[~df2.isin(merged)].dropna()

In [72]: unmerged
Out[72]: 
            Song  Rank
1  like a virgin   2.0

Step 3: Use difflib's get_close_matches to get the closest match

In [73]: songs = list(df1['Song'].unique())

In [74]: def closest(a):
    ...:     try:
    ...:         return difflib.get_close_matches(a, songs)[0]
    ...:     except IndexError:
    ...:         return "Not Found"

In [75]: unmerged['closest_song'] = unmerged.apply(lambda row: closest(row['Song']), axis=1)

In [76]: unmerged
Out[76]: 
            Song  Rank  closest_song
1  like a virgin   2.0  like a virgi

Step 4: Get the similarity percentage if you want

In [77]: def similar(a, b):
    ...:     return difflib.SequenceMatcher(None, a, b).ratio()

In [78]: unmerged['Similarity'] = unmerged.apply(lambda row: similar(row['closest_song'], row['Song']), axis=1)

In [79]: unmerged
Out[79]: 
            Song  Rank  closest_song  Similarity
1  like a virgin   2.0  like a virgi        0.96

Step 5: Merge using the closest values

In [80]: unmerged.rename(columns={'Song': 'Old_Song', 'closest_song': 'Song'}, inplace=True)

In [81]: new = unmerged.merge(df1, on='Song')[['Song', 'Artist', 'Rank']]
Out[81]: 
           Song   Artist  Rank
0  like a virgi  madonna   2.0

In [82]: pd.concat([merged, new])
Out[82]: 
           Song    Artist  Rank
0        mysong  myartist   1.0
0  like a virgi   madonna   2.0


来源:https://stackoverflow.com/questions/50560174/similarity-between-2-dataframe-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!