Algorithm to find edit distance to all substrings

放肆的年华 提交于 2019-12-05 17:10:26

问题


Given 2 strings s and t. I need to find for each substring in s edit distance(Levenshtein distance) to t. Actually I need to know for each i position in s what is the minimum edit distance for all substrings started at position i.

For example:

t = "ab"    
s = "sdabcb"

And I need to get something like:

{2,1,0,2,2}

Explanation:

1st position:
distance("ab", "sd") = 4 ( 2*subst )
distance("ab", "sda") = 3( 2*delete + insert )
distance("ab", "sdab") = 2 ( 2 * delete)
distance("ab", "sdabc") = 3 ( 3 * delete)
distance("ab", "sdabcb") = 4 ( 4 * delete)
So, minimum is 2

2nd position:
distance("ab", "da") = 2 (delete + insert)
distance("ab", "dab") = 1 (delete)
distance("ab", "dabc") = 2 (2*delete)
....
So, minimum is 1

3th position:
distance("ab", "ab") = 0
...
minimum is 0

and so on.

I can use brute force algorithm to solve this task, of course. But is there faster algorithm?

Thanks for help.


回答1:


The Wagner-Fischer algorithm gives you the answer for all prefixes "for free".

http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm

The last row of the Wagner-Fischer matrix contains the edit distance from each prefix of s to t.

So as a first crack at your problem, for each i, run Wagner-Fischer and select the smallest element in the last row.

I will be curious to see if anyone else knows (or can find) a better approach.




回答2:


To find substrings in a given string is very easy. You take the normal Levenshtein algorithm and modify it slightly.

FIRST: Instead of filling the first row of the matrix with 0,1,2,3,4,5,... you fill it entirely with zeros. (green rectangle)

SECOND: Then you run the algorithm.

THIRD: Instead of returning the last cell of the last row you search for the smallest value in the last row and return it. (red rectangle)

Example: needle: "aba", haystack: "c abba c" --> result = 1 (converting abba -> aba)

I tested it and it works.

This is much faster than your suggestion of stepping character by character through the string as you do in your question. You only create the matrix once.



来源:https://stackoverflow.com/questions/8139958/algorithm-to-find-edit-distance-to-all-substrings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!