Computation of Kullback-Leibler (KL) distance between text-documents using numpy

前端 未结 3 1738
梦毁少年i
梦毁少年i 2021-02-08 05:33

My goal is to compute the KL distance between the following text documents:

1)The boy is having a lad relationship
2)The boy is having a boy relationship
3)It is         


        
3条回答
  •  不思量自难忘°
    2021-02-08 05:35

    After a bit of googling to undersand the KL concept, I think that your problem is due to the vectorization : you're comparing the number of appearance of different words. You should either link your column indice to one word, or use a dictionnary:

    #  The boy is having a lad relationship It lovely day in NY
    1)[1   1   1  1      1 1   1            0  0      0   0  0]
    2)[1   2   1  1      1 0   1            0  0      0   0  0]
    3)[0   0   1  0      1 0   0            1  1      1   1  1]
    

    Then you can use your kl function.

    To automatically vectorize to a dictionnary, see How to count the frequency of the elements in a list? (collections.Counter is exactly what you need). Then you can loop over the union of the keys of the dictionaries to compute the KL distance.

提交回复
热议问题