问题
I have a list of medicine names(regular_list) and a list of new names(new_list).I want to check whether the names in the new_list are already present in the regular_list or not.The issue is that the names new_list could have some typo errors and I want those name to be considered as a match to the regular list. I know that using stringdist is a solution to the problem but I need a machine learning algorithm
回答1:
As it was already mentioned here machine learning to overcome typo errors , machine learning tools are too much for such task, but the simplest possibility would be to merge those approaches.
On one hand, you can compute the edit distance
between given word x
and each of the dictionary words d_i
. Additionaly, you can traing per-word classifier
c(d_i, distance(x,d_i))
returning True
(class 1
) if a given edit distance has been learned to be sufficient to consider x
a missspelled version of d_i
. This can give you more general model then not using machine learning, as you can have different thresholds for each dictionary word (some words are more often misspelled then others), but obviously, you have to prepare a training set in form of (misspelled_word, correct_one)
(and add also (correct_one, correct_one
).
You can use any type of binary classifier for such task, which can work on "real" input data.
来源:https://stackoverflow.com/questions/18374749/machine-learning-algorithm-for-spelling-check