问题
I tried searching the answer in SO but didnt find any help.
Here is what I´m trying to do:
I have a dataframe (here is a small example of it):
df = pd.DataFrame([[1, 5, 'AADDEEEEIILMNORRTU'], [2, 5, 'AACEEEEGMMNNTT'], [3, 5, 'AAACCCCEFHIILMNNOPRRRSSTTUUY'], [4, 5, 'DEEEGINOOPRRSTY'], [5, 5, 'AACCDEEHHIIKMNNNNTTW'], [6, 5, 'ACEEHHIKMMNSSTUV'], [7, 5, 'ACELMNOOPPRRTU'], [8, 5, 'BIT'], [9, 5, 'APR'], [10, 5, 'CDEEEGHILLLNOOST'], [11, 5, 'ACCMNO'], [12, 5, 'AIK'], [13, 5, 'CCHHLLOORSSSTTUZ'], [14, 5, 'ANNOSXY'], [15, 5, 'AABBCEEEEHIILMNNOPRRRSSTUUVY']],columns=['PartnerId','CountryId','Name'])
My goal is to find the PartnerId
s which Name
is similar at least up to a certain ratio
.
Additionally I only want to compare PartnerId
s that have the same CountryId
. The matching PartnerId
s should be appended to a list and finally written in a new column in the dataframe.
Here is my try:
itemDict = {item[0]: {'CountryId': item[1], 'Name': item[2]} for item in df.values}
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
def calculate_similarity(x,itemDict):
own_name = x['Name']
country_id = x['CountryId']
matching_ids = []
for k, v in itemDict.items():
if k != x['PartnerId']:
if v['CountryId'] == country_id:
ratio = similar(own_name,v['Name'])
if ratio > 0.7:
matching_ids.append(k)
return matching_ids
df['Similar_IDs'] = df.apply(lambda x: calculate_similarity(x,itemDict),axis=1)
print(df)
The output is:
PartnerId CountryId Name Similar_IDs
0 1 5 AADDEEEEIILMNORRTU []
1 2 5 AACEEEEGMMNNTT []
2 3 5 AAACCCCEFHIILMNNOPRRRSSTTUUY [15]
3 4 5 DEEEGINOOPRRSTY [10]
4 5 5 AACCDEEHHIIKMNNNNTTW []
5 6 5 ACEEHHIKMMNSSTUV []
6 7 5 ACELMNOOPPRRTU []
7 8 5 BIT []
8 9 5 APR []
9 10 5 CDEEEGHILLLNOOST [4]
10 11 5 ACCMNO []
11 12 5 AIK []
12 13 5 CCHHLLOORSSSTTUZ []
13 14 5 ANNOSXY []
14 15 5 AABBCEEEEHIILMNNOPRRRSSTUUVY [3]
My questions now are:
1.) Is there a more efficient way to compute it? I have about 20.000 rows now and a lot more in the near future.
2.) Is it possible to get "rid" of the itemDict and do it directly from the dataframe?
3.) Is another distance measure maybe better to use?
Thanks a lot for your help!
回答1:
You can use the module difflib
. First, you need to make a cartesian product of all strings by joining the table to itself using outer join:
cols = ['Name', 'CountryId', 'PartnerId']
df = df[cols].merge(df[cols], on='CountryId', how='outer')
df = df.query('PartnerId_x != PartnerId_y')
In the next step you can apply the function from this answer and filter out all matches:
def match(x):
return SequenceMatcher(None, x[0], x[1]).ratio()
match = df.apply(match, axis=1) > 0.7
df.loc[match, ['PartnerId_x', 'Name_x', 'PartnerId_y']]
Output:
PartnerId_x Name_x PartnerId_y
44 3 AAACCCCEFHIILMNNOPRRRSSTTUUY 15
54 4 DEEEGINOOPRRSTY 10
138 10 CDEEEGHILLLNOOST 4
212 15 AABBCEEEEHIILMNNOPRRRSSTUUVY 3
If you don't have enough memory you can try to iterate over the rows of a data frame:
lst = []
for idx, row in df.iterrows():
if SequenceMatcher(None, row['Name_x'], row['Name_y']).ratio() > 0.7:
lst.append(row[['PartnerId_x', 'Name_x', 'PartnerId_y']])
pd.concat(lst, axis=1).T
来源:https://stackoverflow.com/questions/59783162/pandas-similarity-matching