Levenshtein distance: how to better handle words swapping positions?

前端未结

关注

 9  1335

忘掉有多难

I\'ve had some success comparing strings using the PHP levenshtein function.

However, for two strings which contain substrings that have swapped positions, the algorithm

相关标签:

9条回答

野趣味

2021-01-30 02:58

Take this answer and make the following change:

void match(trie t, char* w, string s, int budget){
  if (budget < 0) return;
  if (*w=='\0') print s;
  foreach (char c, subtrie t1 in t){
    /* try matching or replacing c */
    match(t1, w+1, s+c, (*w==c ? budget : budget-1));
    /* try deleting c */
    match(t1, w, s, budget-1);
  }
  /* try inserting *w */
  match(t, w+1, s + *w, budget-1);
  /* TRY SWAPPING FIRST TWO CHARACTERS */
  if (w[1]){
    swap(w[0], w[1]);
    match(t, w, s, budget-1);
    swap(w[0], w[1]);
  }
}

This is for dictionary search in a trie, but for matching to a single word, it's the same idea. You're doing branch-and-bound, and at any point, you can make any change you like, as long as you give it a cost.

0 讨论(0)

失恋的感觉

2021-01-30 02:59

Explode on spaces, sort the array, implode, then do the Levenshtein.

0 讨论(0)
发布评论:

提交评论
- 加载中...
面向向阳花

2021-01-30 03:05
If the first string is A and the second one is B:
1. Split A and B into words
2. For every word in A, find the best matching word in B (using levenshtein)
3. Remove that word from B and put it in B* at the same index as the matching word in A.
4. Now compare A and B*
Example:
```
A: The quick brown fox
B: Quick blue fox the
B*: the Quick blue fox
```
You could improve step 2 by doing it in multiple passes, finding only exact matches at first, then finding close matches for words in A that don't have a companion in B* yet, then less close matches, etc.
0 讨论(0)
发布评论:

提交评论
- 加载中...

暗喜

2021-01-30 03:08

You can also try this. (just an extra suggestion)

$one = metaphone("The quick brown fox"); // 0KKBRNFKS
$two = metaphone("brown quick The fox"); // BRNKK0FKS
$three = metaphone("The quiet swine flu"); // 0KTSWNFL

similar_text($one, $two, $percent1); // 66.666666666667
similar_text($one, $three, $percent2); // 47.058823529412
similar_text($two, $three, $percent3); // 23.529411764706

This will show that the 1st and 2nd are more similar than one and three and two and three.

0 讨论(0)

鱼传尺愫

2021-01-30 03:17
N-grams

Use N-grams, which support multiple-character transpositions across the whole text.

The general idea is that you split the two strings in question into all the possible 2-3 character substrings (n-grams) and treat the number of shared n-grams between the two strings as their similarity metric. This can be then normalized by dividing the shared number by the total number of n-grams in the longer string. This is trivial to calculate, but fairly powerful.

For the example sentences:
```
A. The quick brown fox
B. brown quick The fox
C. The quiet swine flu
```
A and B share 18 2-grams

A and C share only 8 2-grams

out of 20 total possible.

This has been discussed in more detail in the Gravano et al. paper.

tf-idf and cosine similarity

A not so trivial alternative, but grounded in information theory would be to use term term frequency–inverse document frequency (tf-idf) to weigh the tokens, construct sentence vectors and then use cosine similarity as the similarity metric.

The algorithm is:
1. Calculate 2-character token frequencies (tf) per sentence.
2. Calculate inverse sentence frequencies (idf), which is a logarithm of a quotient of the number of all sentences in the corpus (in this case 3) divided by the number of times a particular token appears across all sentences. In this case th is in all sentences so it has zero information content (log(3/3)=0).
3. Produce the tf-idf matrix by multiplying corresponding cells in the tf and idf tables.
4. Finally, calculate cosine similarity matrix for all sentence pairs, where A and B are weights from the tf-idf table for the corresponding tokens. The range is from 0 (not similar) to 1 (equal).
Levenshtein modifications and Metaphone

Regarding other answers. Damerau–Levenshtein modificication supports only the transposition of two adjacent characters. Metaphone was designed to match words that sound the same and not for similarity matching.
0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2021-01-30 03:18

Its easy. Just use the Damerau-Levenshtein distance on the words instead of letters.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页

Levenshtein distance: how to better handle words swapping positions?

N-grams

tf-idf and cosine similarity

Levenshtein modifications and Metaphone