Improving search result using Levenshtein distance in Java

前端未结

关注

 5  1345

南方客 2021-01-31 03:08

I have following working Java code for searching for a word against a list of words and it works perfectly and as expected:

public class Levenshtein {
    privat


      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   心在旅途
                                             
                
                
                (楼主)
            
              
              
                2021-01-31 04:00
              

            
            
                        
You can modify Levenshtein Distance by adjusting the scoring when consecutive characters match.

Whenever there are consecutive characters that match, the score can then be reduced thus making the search more relevent.

eg : Lets say the factor by which we want to reduce score by is 10 then if in a word we find the substring "job" we can reduce the score by 10 when we encounter "j" furthur reduce it by (10 + 20) when we find the string "jo" and finally reduce the score by (10 + 20 + 30) when we find "job".

I have written a c++ code below : 

#include 

#define INF -10000000
#define FACTOR 10

using namespace std;

double memo[100][100][100];

double Levenshtein(string inputWord, string checkWord, int i, int j, int count){
    if(i == inputWord.length() && j == checkWord.length()) return 0;    
    if(i == inputWord.length()) return checkWord.length() - j;
    if(j == checkWord.length()) return inputWord.length() - i;
    if(memo[i][j][count] != INF) return memo[i][j][count];

    double ans1 = 0, ans2 = 0, ans3 = 0, ans = 0;
    if(inputWord[i] == checkWord[j]){
        ans1 = Levenshtein(inputWord, checkWord, i+1, j+1, count+1) - (FACTOR*(count+1));
        ans2 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;
        ans3 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;
        ans = min(ans1, min(ans2, ans3));
    }else{
        ans1 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;
        ans2 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;
        ans = min(ans1, ans2);
    }
    return memo[i][j][count] = ans;
}

int main(void) {
    // your code goes here
    string word = "job";
    string wordList[40];
    vector< pair  > ans;
    for(int i = 0;i < 40;i++){
        cin >> wordList[i];
        for(int j = 0;j < 100;j++) for(int k = 0;k < 100;k++){
            for(int m = 0;m < 100;m++) memo[j][k][m] = INF;
        }
        ans.push_back( make_pair(Levenshtein(word, wordList[i], 
            0, 0, 0), wordList[i]) );
    }
    sort(ans.begin(), ans.end());
    for(int i = 0;i < ans.size();i++){
        cout << ans[i].second << " " << ans[i].first << endl;
    }
    return 0;
}


Link to demo : http://ideone.com/4UtCX3

Here the FACTOR is taken as 10, you can experiment with other words and choose the appropriate value.

Also note that the complexity of the above Levenshtein Distance has also increased, it is now O(n^3) instead of O(n^2) as now we are also keeping track of the counter that counts how many consecutive characters we have encountered.

You can further play with the score by increasing it gradually after you find some consecutive substring and then a mismatch, instead of the current way where we have a fixed score of 1 that is added to the overall score.

Also in the above solution you can remove the strings that have score >=0 as they are not at all releavent you can also choose some other threshold for that to have a more accurate search.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复