Normalizing the edit distance

大兔子大兔子 提交于 2019-11-30 05:00:47

问题


I have a question that can we normalize the levenshtein edit distance by dividing the e.d value by the length of the two strings? I am asking this because, if we compare two strings of unequal length, the difference between the lengths of the two will be counted as well. for eg: ed('has a', 'has a ball') = 4 and ed('has a', 'has a ball the is round') = 15. if we increase the length of the string, the edit distance will increase even though they are similar. Therefore, I can not set a value, what a good edit distance value should be.


回答1:


Yes, normalizing the edit distance is one way to put the differences between strings on a single scale from "identical" to "nothing in common".

A few things to consider:

  1. Whether or not the normalized distance is a better measure of similarity between strings depends on the application. If the question is "how likely is this word to be a misspelling of that word?", normalization is a way to go. If it's "how much has this document changed since the last version?", the raw edit distance may be a better option.
  2. If you want the result to be in the range [0, 1], you need to divide the distance by the maximum possible distance between two strings of given lengths. That is, length(str1)+length(str2) for the LCS distance and max(length(str1), length(str2)) for the Levenshtein distance.
  3. The normalized distance is not a metric, as it violates the triangle inequality.


来源:https://stackoverflow.com/questions/45783385/normalizing-the-edit-distance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!