Optimizing Levenshtein distance algorithm

问题

I have a stored procedure that uses Levenshtein distance to determine the result closest to what the user typed. The only thing really affecting the speed is the function that calculates the Levenshtein distance for all the records before selecting the record with the lowest distance (I've verified this by putting a 0 in place of the call to the Levenshtein function). The table has 1.5 million records, so even the slightest adjustment may shave off a few seconds. Right now the entire thing runs over 10 minutes. Here's the method I'm using:

ALTER function dbo.Levenshtein
( 
    @Source nvarchar(200), 
    @Target nvarchar(200) 
) 
RETURNS int
AS
BEGIN
DECLARE @Source_len int, @Target_len int, @i int, @j int, @Source_char nchar, @Dist int, @Dist_temp int, @Distv0 varbinary(8000), @Distv1 varbinary(8000)

SELECT @Source_len = LEN(@Source), @Target_len = LEN(@Target), @Distv1 = 0x0000, @j = 1, @i = 1, @Dist = 0

WHILE @j <= @Target_len
BEGIN
    SELECT @Distv1 = @Distv1 + CAST(@j AS binary(2)), @j = @j + 1
END

WHILE @i <= @Source_len
BEGIN
    SELECT @Source_char = SUBSTRING(@Source, @i, 1), @Dist = @i, @Distv0 = CAST(@i AS binary(2)), @j = 1

WHILE @j <= @Target_len
BEGIN
    SET @Dist = @Dist + 1
    SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j-1, 2) AS int) +
                  CASE WHEN @Source_char = SUBSTRING(@Target, @j, 1) THEN 0 ELSE 1 END

    IF @Dist > @Dist_temp
    BEGIN
        SET @Dist = @Dist_temp
    END

    SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j+1, 2) AS int)+1

    IF @Dist > @Dist_temp SET @Dist = @Dist_temp
    BEGIN
        SELECT @Distv0 = @Distv0 + CAST(@Dist AS binary(2)), @j = @j + 1
    END
END

SELECT @Distv1 = @Distv0, @i = @i + 1
END

RETURN @Dist
END

Where should I go from here?

回答1:

The way I've done this in the past is to store the "database" (actually a dictionary of words for a spelling correcter) as a trie.

Then I used a branch-and-bound routine to look up nearest matching entries. For small distances, the time it takes is exponential in the distance. For large distances, it is linear in the size of the dictionary, just as you are seeing now.

Branch-and-bound is basically a depth-first tree walk of the trie, but with an error budget. At each node, you keep track of the current levenshtein distance, and if it exceeds the budget, you prune that branch of the tree.

First you do the walk with a budget of zero. That will only find exact matches. If you don't find a match, then you walk it with a budget of one. That will find matches at a distance of 1. If you don't find any, then you do it with a budget of 2, and so on. This sounds inefficient, but since each walk takes so much more time than the previous one, the time is dominated by the last walk that you make.

Added: outline of code (pardon my C):

// dumb version of trie node, indexed by letter. You can improve.
typedef struct tnodeTag {
  tnodeTag* p[128];
} tnode;

tnode* top; // the top of the trie

void walk(tnode* p, char* s, int budget){
  int i;
  if (*s == 0){
    if (p == NULL){
      // print the current trie path
    }
  }
  else if (budget >= 0){
    // try deleting this letter
    walk(p, s+1, budget-1);
    // try swapping two adjacent letters
    if (s[1]){
      swap(s[0], s[1]);
      walk(p, s, budget-1);
      swap(s[0], s[1]);
    }
    if (p){
      for (i = 0; i < 128; i++){
        // try exact match
        if (i == *s) walk(p->p[i], s+1, budget);
        // try replacing this character
        if (i != *s) walk(p->p[i], s+1, budget-1);
        // try inserting this letter
        walk(p->p[i], s, budget-1);
      }
    }
  }
}

Basically, you simulate deleting a letter by skipping it and searching at the same node. You simulate inserting a letter by descending the trie without advancing s. You simulate replacing a letter by acting as if the letter matched, even though it doesn't. When you get the hang of it, you can add other possible mismatches, like replacing 0 with O and 1 with L or I - dumb stuff like that.

You probably want to add a character array argument to represent the current word you are finding in the trie.

来源：https://stackoverflow.com/questions/2918771/optimizing-levenshtein-distance-algorithm

标签

optimization

levenshtein-distance

edit-distance