Levenshtein distance: how to better handle words swapping positions?

前端未结

关注

 9  1336

忘掉有多难

I\'ve had some success comparing strings using the PHP levenshtein function.

However, for two strings which contain substrings that have swapped positions, the algorithm

相关标签:

9条回答

醉酒成梦

2021-01-30 03:20
I've been implementing levenshtein in a spell checker.

What you're asking for is counting transpositions as 1 edit.

This is easy if you only wish to count transpositions of one word away. However for transposition of words 2 or more away, the addition to the algorithm is worst case scenario !(max(wordorder1.length(), wordorder2.length())). Adding a non-linear subalgorithm to an already quadratic algorithm is not a good idea.

This is how it would work.
```
if (wordorder1[n] == wordorder2[n-1])
{
  min(workarray[x-1, y] + 1, workarray[x, y-1] + 1, workarray[x-2, y-2]);
}
  else
{
  min(workarray[x-1, y] + 1, workarray[x, y-1] + 1);
}
```
JUST for touching transpositions. If you want all transpositions, you'd have to for every position work backwards from that point comparing
```
1[n] == 2[n-2].... 1[n] == 2[0]....
```
So you see why they don't include this in the standard method.
0 讨论(0)
发布评论:

提交评论
- 加载中...
半阙折子戏

2021-01-30 03:20

i believe this is a prime example for using a vector-space search engine.

in this technique, each document essentially becomes a vector with as many dimensions as there are different words in the entire corpus; similar documents then occupy neighboring areas in that vector space. one nice property of this model is that queries are also just documents: to answer a query, you simply calculate their position in vector space, and your results are the closest documents you can find. i am sure there are get-and-go solutions for PHP out there.

to fuzzify results from vector space, you could consider to do stemming / similar natural language processing technique, and use levenshtein to construct secondary queries for similar words that occur in your overall vocabulary.

0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2021-01-30 03:22

Eliminate duplicate words between the two strings and then use Levenshtein.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2