Computation of Kullback-Leibler (KL) distance between text-documents using numpy

前端未结

关注

 3  1743

My goal is to compute the KL distance between the following text documents:

1)The boy is having a lad relationship
2)The boy is having a boy relationship
3)It is


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  不思量自难忘°        
                
              
                            
                2021-02-08 05:35
              
            
            
                                                                       
After a bit of googling to undersand the KL concept, I think that your problem is due to the vectorization : you're comparing the number of appearance of different words. You should either link your column indice to one word, or use a dictionnary:

#  The boy is having a lad relationship It lovely day in NY
1)[1   1   1  1      1 1   1            0  0      0   0  0]
2)[1   2   1  1      1 0   1            0  0      0   0  0]
3)[0   0   1  0      1 0   0            1  1      1   1  1]


Then you can use your kl function.

To automatically vectorize to a dictionnary, see How to count the frequency of the elements in a list? (collections.Counter is exactly what you need). Then you can loop over the union of the keys of the dictionaries to compute the KL distance.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情歌与酒        
                
              
                            
                2021-02-08 05:42
              
            
            
                                                                       
A potential issue might be in your NP definition of KL. Read the wikipedia page for formula: http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

Note that you multiply (p-q) by the log result. In accordance with the KL formula, this should only be p:

 return np.sum(np.where(p != 0,(p) * np.log10(p / q), 0))


That may help...
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长情又很酷        
                
              
                            
                2021-02-08 05:49
              
            
            
                                                                       
Though I hate to add another answer, there are two points here. First, as Jaime pointed out in the comments, KL divergence (or distance - they are, according to the following documentation, the same) is designed to measure the difference between probability distributions. This means basically that what you pass to the function should be two array-likes, the elements of each of which sum to 1. 

Second, scipy apparently does implement this, with a naming scheme more related to the field of information theory. The function is "entropy":

scipy.stats.entropy(pk, qk=None, base=None)


http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html

From the docs:


  If qk is not None, then compute a relative entropy (also known as
  Kullback-Leibler divergence or Kullback-Leibler distance) S = sum(pk *
  log(pk / qk), axis=0).


The bonus of this function as well is that it will normalize the vectors you pass it if they do not sum to 1 (though this means you have to be careful with the arrays you pass - ie, how they are constructed from data).

Hope this helps, and at least a library provides it so don't have to code your own.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复