问题
I have a column in a dataframe that is like this:
OWNER
--------------
OTTO J MAYER
OTTO MAYER
DANIEL J ROSEN
DANIEL ROSSY
LISA CULLI
LISA CULLY
LISA CULLY
CITY OF BELMONT
CITY OF BELMONT CITY
Some of the names in my data frame are misspelled or having extra/missing characters. I need a column where the names are replaced by any close match in the same column. However, all the similar names need to be group by under one same name.
For example this is I what I expect from the data frame above:
NAME
--------------
OTTO J MAYER
OTTO J MAYER
DANIEL J ROSEN
DANIEL ROSSY
LISA CULLY
LISA CULLY
LISA CULLY
CITY OF BELMONT
CITY OF BELMONT
OTTO MAYER is replaced with OTTO J MAYER because they are both very similar. The DANIEL's stayed the same because they do not match much. The LISA CULL's all have the same values and etc.
I have some code I got from another post on stack overflow that was trying to solve something similar but they are using a dictionary of names. However, I'm having trouble reworking their code to produce the output that I need.
Here is what I have currently:
d = pd.DataFrame({'OWNER' : pd.Series(['OTTO J MAYER', 'OTTO MAYER','DANIEL J ROSEN','DANIEL ROSSY',
'LISA CULLI', 'LISA CULLY'])})
names = d['OWNER']
names = names.values
names
import difflib
def best_match(tokens, names):
for i,t in enumerate(tokens):
closest = difflib.get_close_matches(t, names, n=1)
if len(closest) > 0:
return i, closest[0]
return None
def fuzzy_replace(x, y):
names = y # just a simple replacement list
tokens = x.split()
res = best_match(tokens, y)
if res is not None:
pos, replacement = res
return u" ".join(tokens)
return x
d["OWNER"].apply(lambda x: fuzzy_replace(x, names))
回答1:
Indeed difflib.get_close_matches is fit for the task, but splitting the name into tokens does no good. In order to differentiate the names as specified, we have to raise the cutoff score to about 0.8, and to make sure that all possible names are returned, raise the maximum number to len(names)
. Then we have two cases to decide which name to prefer:
- If a name occurs more often than the others, choose that one.
- Otherwise choose the one occurring first.
def fuzzy_replace(x, names):
aliases = difflib.get_close_matches(x, names, len(names), .8)
closest = pd.Series(aliases).mode()
closest = aliases[0] if closest.empty else closest[0]
d['OWNER'].replace(aliases, closest, True)
for x in d["OWNER"]: fuzzy_replace(x, d['OWNER'])
来源:https://stackoverflow.com/questions/58904764/pandas-replace-strings-with-fuzzy-match-in-the-same-column