问题
I have a list
of words:
lst = ['dog', 'cat', 'mate', 'mouse', 'zebra', 'lion']
I also have a pandas
dataframe:
df = pd.DataFrame({'input': ['dog', 'kat', 'leon', 'moues'], 'suggested_class': ['a', 'a', 'a', 'a']})
input suggested_class
dog a
kat a
leon a
moues a
I would like to populate the suggested_class
column with the value from lst
that has the highest levenshtein distance to a word in the input
column. I am using the fuzzywuzzy
package to calculate that.
The expected output would be:
input suggested_class
dog dog
kat cat
leon lion
moues mouse
I'm aware that one could implement something with the autocorrect
package like df.suggested_class = [autocorrect.spell(w) for w in df.input]
but this would not work for my situation.
I've tried something like this (using from fuzzywuzzy import fuzz
):
for word in lst:
for n in range(0, len(df.input)):
if fuzz.ratio(df.input.iloc[n], word) >= 70:
df.suggested_class.iloc[n] = word
else:
df.suggested_class.iloc[n] = "unknown"
which only works for a set distance. I've been able to capture the max distance with:
max([fuzz.ratio(df.input.iloc[0], word) for word in lst])
but am having trouble relating that to a word from lst, and subsequently populating suggested_class
with that word.
回答1:
Since you mention fuzzywuzzy
from fuzzywuzzy import process
df['suggested_class']=df.input.apply(lambda x : [process.extract(x, lst, limit=1)][0][0][0])
df
Out[1365]:
input suggested_class
0 dog dog
1 kat cat
2 leon lion
3 moues mouse
来源:https://stackoverflow.com/questions/49680416/most-likely-word-based-on-max-levenshtien-distance