What is a good strategy to group similar words?

前端未结

关注

 5  1729

Say I have a list of movie names with misspellings and small variations like this -

 \"Pirates of the Caribbean: The Curse of the Black Pearl\"
 \"Pirates o


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2020-12-29 11:48
              
            
            
                                                                       
Have a look at "fuzzy matching". Some great tools in the thread below that calculates similarities between strings.

I'm especially fond of the difflib module

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']


https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  失恋的感觉        
                
              
                            
                2020-12-29 11:53
              
            
            
                                                                       
I believe there is in fact two distinct problems.

The first is spell correction. You can have one in Python here 

http://norvig.com/spell-correct.html

The second is more functional. Here is what I'd do after the spell correction. I would make a relation function.

related( sentence1, sentence2 ) if and only if sentence1 and sentence2 have rare common words. By rare, I mean words different than (The, what, is, etc...). You can take a look at the TF/IDF system to determine if two document are related using their words. Just googling a bit I found this:

https://code.google.com/p/tfidf/ 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我寻月下人不归        
                
              
                            
                2020-12-29 12:01
              
            
            
                                                                       
You might notice that similar strings have large common substring, for example: 


  "Bla bla bLa" and "Bla bla bRa" => common substring is "Bla bla ba" (notice the third word)


To find common substring you may use dynamic programming algorithm. One of algorithms variations is Levenshtein distance (distance between most similar strings is very small, and between more different strings distance is bigger) - http://en.wikipedia.org/wiki/Levenshtein_distance. 

Also for quick performance you may try to adapt Soundex algorithm - http://en.wikipedia.org/wiki/Soundex.

So after calculating distance between all your strings, you have to clusterize them. The most simple way is k-means (but it needs you to define number of clusters). If you actually don't know number of clusters, you have to use hierarchical clustering. Note that number of clusters in your situation is number of different movies titles + 1(for totally bad spelled strings).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2020-12-29 12:02
              
            
            
                                                                       
To add another tip to Fredrik's answer, you could also get inspired from search engines like code, such as this one :

def dosearch(terms, searchtype, case, affffdir, files = []):
    found = []
    if files != None:
        titlesrch = re.compile('>title<.*>/title<')
        for file in files:
            title = ""
            if not (file.lower().endswith("html") or file.lower().endswith("htm")):
                continue
            filecontents = open(BASE_DIR + affffdir + file, 'r').read()
            titletmp = titlesrch.search(filecontents)
            if titletmp != None:
                title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8]
            filecontents = remove_tags(filecontents)
            filecontents = filecontents.lstrip()
            filecontents = filecontents.rstrip()
            if dofind(filecontents, case, searchtype, terms) > 0:
                found.append(title)
                found.append(file)
    return found


Source and more information: http://www.zackgrossbart.com/hackito/search-engine-python/

Regards,

Max
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2020-12-29 12:02
              
            
            
                                                                       
One approach would be to pre-process all the strings before you compare them: convert all to lowercase, standardize whitespace (eg, replace any whitespace with single spaces). If punctuation is not important to your end goal, you can remove all punctuation characters as well.

Levenshtein distance is commonly-used to determine similarity of a string, this should help you group strings which differ by small spelling errors.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复