Pandas Similarity Matching

问题

I tried searching the answer in SO but didnt find any help.

Here is what I´m trying to do:
I have a dataframe (here is a small example of it):

 df = pd.DataFrame([[1, 5, 'AADDEEEEIILMNORRTU'], [2, 5, 'AACEEEEGMMNNTT'], [3, 5, 'AAACCCCEFHIILMNNOPRRRSSTTUUY'], [4, 5, 'DEEEGINOOPRRSTY'], [5, 5, 'AACCDEEHHIIKMNNNNTTW'], [6, 5, 'ACEEHHIKMMNSSTUV'], [7, 5, 'ACELMNOOPPRRTU'], [8, 5, 'BIT'], [9, 5, 'APR'], [10, 5, 'CDEEEGHILLLNOOST'], [11, 5, 'ACCMNO'], [12, 5, 'AIK'], [13, 5, 'CCHHLLOORSSSTTUZ'], [14, 5, 'ANNOSXY'], [15, 5, 'AABBCEEEEHIILMNNOPRRRSSTUUVY']],columns=['PartnerId','CountryId','Name'])

My goal is to find the PartnerIds which Name is similar at least up to a certain ratio.
Additionally I only want to compare PartnerIds that have the same CountryId. The matching PartnerIds should be appended to a list and finally written in a new column in the dataframe.

Here is my try:

itemDict = {item[0]: {'CountryId': item[1], 'Name': item[2]} for item in df.values}

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

def calculate_similarity(x,itemDict):
    own_name = x['Name']
    country_id = x['CountryId']
    matching_ids = []
    for k, v in itemDict.items():

        if k != x['PartnerId']:
            if v['CountryId'] == country_id:

                ratio = similar(own_name,v['Name'])


                if ratio > 0.7:

                    matching_ids.append(k)
    return matching_ids

df['Similar_IDs'] = df.apply(lambda x: calculate_similarity(x,itemDict),axis=1)
print(df)

The output is:

    PartnerId  CountryId                          Name Similar_IDs
0           1          5            AADDEEEEIILMNORRTU          []
1           2          5                AACEEEEGMMNNTT          []
2           3          5  AAACCCCEFHIILMNNOPRRRSSTTUUY        [15]
3           4          5               DEEEGINOOPRRSTY        [10]
4           5          5          AACCDEEHHIIKMNNNNTTW          []
5           6          5              ACEEHHIKMMNSSTUV          []
6           7          5                ACELMNOOPPRRTU          []
7           8          5                           BIT          []
8           9          5                           APR          []
9          10          5              CDEEEGHILLLNOOST         [4]
10         11          5                        ACCMNO          []
11         12          5                           AIK          []
12         13          5              CCHHLLOORSSSTTUZ          []
13         14          5                       ANNOSXY          []
14         15          5  AABBCEEEEHIILMNNOPRRRSSTUUVY         [3]

My questions now are:
1.) Is there a more efficient way to compute it? I have about 20.000 rows now and a lot more in the near future.
2.) Is it possible to get "rid" of the itemDict and do it directly from the dataframe?
3.) Is another distance measure maybe better to use?

Thanks a lot for your help!

回答1:

You can use the module difflib. First, you need to make a cartesian product of all strings by joining the table to itself using outer join:

cols = ['Name', 'CountryId', 'PartnerId']
df = df[cols].merge(df[cols], on='CountryId', how='outer')    
df = df.query('PartnerId_x != PartnerId_y')

In the next step you can apply the function from this answer and filter out all matches:

def match(x):
    return SequenceMatcher(None, x[0], x[1]).ratio()

match = df.apply(match, axis=1) > 0.7
df.loc[match, ['PartnerId_x', 'Name_x', 'PartnerId_y']]

Output:

     PartnerId_x                        Name_x  PartnerId_y
44             3  AAACCCCEFHIILMNNOPRRRSSTTUUY           15
54             4               DEEEGINOOPRRSTY           10
138           10              CDEEEGHILLLNOOST            4
212           15  AABBCEEEEHIILMNNOPRRRSSTUUVY            3

If you don't have enough memory you can try to iterate over the rows of a data frame:

lst = []
for idx, row in df.iterrows():
    if SequenceMatcher(None, row['Name_x'], row['Name_y']).ratio() > 0.7:
        lst.append(row[['PartnerId_x', 'Name_x', 'PartnerId_y']])

pd.concat(lst, axis=1).T

来源：https://stackoverflow.com/questions/59783162/pandas-similarity-matching

标签

python

pandas

similarity