Pandas replace strings with fuzzy match in the same column

问题

I have a column in a dataframe that is like this:

 OWNER
 --------------
 OTTO J MAYER
 OTTO MAYER 
 DANIEL J ROSEN
 DANIEL ROSSY
 LISA CULLI
 LISA CULLY 
 LISA CULLY
 CITY OF BELMONT
 CITY OF BELMONT CITY

Some of the names in my data frame are misspelled or having extra/missing characters. I need a column where the names are replaced by any close match in the same column. However, all the similar names need to be group by under one same name.

For example this is I what I expect from the data frame above:

 NAME
 --------------
 OTTO J MAYER
 OTTO J MAYER 
 DANIEL J ROSEN
 DANIEL ROSSY
 LISA CULLY
 LISA CULLY 
 LISA CULLY
 CITY OF BELMONT
 CITY OF BELMONT

OTTO MAYER is replaced with OTTO J MAYER because they are both very similar. The DANIEL's stayed the same because they do not match much. The LISA CULL's all have the same values and etc.

I have some code I got from another post on stack overflow that was trying to solve something similar but they are using a dictionary of names. However, I'm having trouble reworking their code to produce the output that I need.

Here is what I have currently:

d = pd.DataFrame({'OWNER' : pd.Series(['OTTO J MAYER', 'OTTO MAYER','DANIEL J ROSEN','DANIEL ROSSY',
                                      'LISA CULLI', 'LISA CULLY'])})
names = d['OWNER']
names = names.values
names

import difflib 


def best_match(tokens, names):
    for i,t in enumerate(tokens):
        closest = difflib.get_close_matches(t, names, n=1)
        if len(closest) > 0:
            return i, closest[0]
    return None

def fuzzy_replace(x, y):

    names = y # just a simple replacement list
    tokens = x.split()
    res = best_match(tokens, y)
    if res is not None:
        pos, replacement = res
        return u" ".join(tokens)
    return x

d["OWNER"].apply(lambda x: fuzzy_replace(x, names))

回答1:

Indeed difflib.get_close_matches is fit for the task, but splitting the name into tokens does no good. In order to differentiate the names as specified, we have to raise the cutoff score to about 0.8, and to make sure that all possible names are returned, raise the maximum number to len(names). Then we have two cases to decide which name to prefer:

If a name occurs more often than the others, choose that one.
Otherwise choose the one occurring first.

def fuzzy_replace(x, names):
    aliases = difflib.get_close_matches(x, names, len(names), .8)
    closest = pd.Series(aliases).mode()
    closest = aliases[0] if closest.empty else closest[0]
    d['OWNER'].replace(aliases, closest, True)

for x in d["OWNER"]: fuzzy_replace(x, d['OWNER'])

来源：https://stackoverflow.com/questions/58904764/pandas-replace-strings-with-fuzzy-match-in-the-same-column

标签

python

regex

pandas

fuzzy-comparison

difflib