Getting the closest string match

后端未结

关注

 13  720

难免孤独 2020-11-22 10:57

I need a way to compare multiple strings to a test string and return the string that closely resembles it:

TEST STRING: THE BROWN FOX JUMPED OVER THE RED COW


      
      
        
          13条回答        

        
                    
            
            
                         
                
              
              
                
                   感情败类
                                             
                
                
                (楼主)
            
              
              
                2020-11-22 11:28
              

            
            
                        
To query a large set of text in efficient manner you can use the concept of Edit Distance/ Prefix Edit Distance.


  Edit Distance ED(x,y): minimal number of transfroms to get from term x to term y


But computing ED between each term and query text is resource and time intensive. Therefore instead of calculating ED for each term first we can extract possible matching terms using a technique called Qgram Index. and then apply ED calculation on those selected terms.

An advantage of Qgram index technique is it supports for Fuzzy Search.

One possible approach to adapt QGram index is build an Inverted Index using Qgrams. In there we store all the words which consists with particular Qgram, under that Qgram.(Instead of storing full string you can use unique ID for each string). You can use Tree Map data structure in Java for this.
Following is a small example on storing of terms


  col : colmbia, colombo, gancola, tacolama


Then when querying, we calculate the number of common Qgrams between query text and available terms.

Example: x = HILLARY, y = HILARI(query term)
Qgrams
$$HILLARY$$ -> $$H, $HI, HIL, ILL, LLA, LAR, ARY, RY$, Y$$
$$HILARI$$ -> $$H, $HI, HIL, ILA, LAR, ARI, RI$, I$$
number of q-grams in common = 4


number of q-grams in common = 4.

For the terms with high number of common Qgrams, we calculate the ED/PED against the query term and then suggest the term to the end user.

you can find an implementation of this theory in following project(See "QGramIndex.java"). Feel free to ask any questions. https://github.com/Bhashitha-Gamage/City_Search

To study more about Edit Distance, Prefix Edit Distance Qgram index please watch the following video of Prof. Dr Hannah Bast https://www.youtube.com/embed/6pUg2wmGJRo (Lesson starts from 20:06)
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它13个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复