Computation of Kullback-Leibler (KL) distance between text-documents using numpy

前端 未结 3 1735
梦毁少年i
梦毁少年i 2021-02-08 05:33

My goal is to compute the KL distance between the following text documents:

1)The boy is having a lad relationship
2)The boy is having a boy relationship
3)It is         


        
3条回答
  •  长情又很酷
    2021-02-08 05:49

    Though I hate to add another answer, there are two points here. First, as Jaime pointed out in the comments, KL divergence (or distance - they are, according to the following documentation, the same) is designed to measure the difference between probability distributions. This means basically that what you pass to the function should be two array-likes, the elements of each of which sum to 1.

    Second, scipy apparently does implement this, with a naming scheme more related to the field of information theory. The function is "entropy":

    scipy.stats.entropy(pk, qk=None, base=None)
    

    http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html

    From the docs:

    If qk is not None, then compute a relative entropy (also known as Kullback-Leibler divergence or Kullback-Leibler distance) S = sum(pk * log(pk / qk), axis=0).

    The bonus of this function as well is that it will normalize the vectors you pass it if they do not sum to 1 (though this means you have to be careful with the arrays you pass - ie, how they are constructed from data).

    Hope this helps, and at least a library provides it so don't have to code your own.

提交回复
热议问题