Levenshtein distance: how to better handle words swapping positions?

前端 未结 9 1317
忘掉有多难
忘掉有多难 2021-01-30 02:22

I\'ve had some success comparing strings using the PHP levenshtein function.

However, for two strings which contain substrings that have swapped positions, the algorithm

相关标签:
9条回答
  • 2021-01-30 03:20

    I've been implementing levenshtein in a spell checker.

    What you're asking for is counting transpositions as 1 edit.

    This is easy if you only wish to count transpositions of one word away. However for transposition of words 2 or more away, the addition to the algorithm is worst case scenario !(max(wordorder1.length(), wordorder2.length())). Adding a non-linear subalgorithm to an already quadratic algorithm is not a good idea.

    This is how it would work.

    if (wordorder1[n] == wordorder2[n-1])
    {
      min(workarray[x-1, y] + 1, workarray[x, y-1] + 1, workarray[x-2, y-2]);
    }
      else
    {
      min(workarray[x-1, y] + 1, workarray[x, y-1] + 1);
    }
    

    JUST for touching transpositions. If you want all transpositions, you'd have to for every position work backwards from that point comparing

    1[n] == 2[n-2].... 1[n] == 2[0]....
    

    So you see why they don't include this in the standard method.

    0 讨论(0)
  • 2021-01-30 03:20

    i believe this is a prime example for using a vector-space search engine.

    in this technique, each document essentially becomes a vector with as many dimensions as there are different words in the entire corpus; similar documents then occupy neighboring areas in that vector space. one nice property of this model is that queries are also just documents: to answer a query, you simply calculate their position in vector space, and your results are the closest documents you can find. i am sure there are get-and-go solutions for PHP out there.

    to fuzzify results from vector space, you could consider to do stemming / similar natural language processing technique, and use levenshtein to construct secondary queries for similar words that occur in your overall vocabulary.

    0 讨论(0)
  • 2021-01-30 03:22

    Eliminate duplicate words between the two strings and then use Levenshtein.

    0 讨论(0)
提交回复
热议问题