I\'ve had some success comparing strings using the PHP levenshtein function.
However, for two strings which contain substrings that have swapped positions, the algorithm
I've been implementing levenshtein in a spell checker.
What you're asking for is counting transpositions as 1 edit.
This is easy if you only wish to count transpositions of one word away. However for transposition of words 2 or more away, the addition to the algorithm is worst case scenario !(max(wordorder1.length(), wordorder2.length()))
. Adding a non-linear subalgorithm to an already quadratic algorithm is not a good idea.
This is how it would work.
if (wordorder1[n] == wordorder2[n-1])
{
min(workarray[x-1, y] + 1, workarray[x, y-1] + 1, workarray[x-2, y-2]);
}
else
{
min(workarray[x-1, y] + 1, workarray[x, y-1] + 1);
}
JUST for touching transpositions. If you want all transpositions, you'd have to for every position work backwards from that point comparing
1[n] == 2[n-2].... 1[n] == 2[0]....
So you see why they don't include this in the standard method.
i believe this is a prime example for using a vector-space search engine.
in this technique, each document essentially becomes a vector with as many dimensions as there are different words in the entire corpus; similar documents then occupy neighboring areas in that vector space. one nice property of this model is that queries are also just documents: to answer a query, you simply calculate their position in vector space, and your results are the closest documents you can find. i am sure there are get-and-go solutions for PHP out there.
to fuzzify results from vector space, you could consider to do stemming / similar natural language processing technique, and use levenshtein to construct secondary queries for similar words that occur in your overall vocabulary.
Eliminate duplicate words between the two strings and then use Levenshtein.