tf idf similarity

淺唱寂寞╮ 提交于 2019-12-12 11:48:00

问题


I am using TF/IDF to calculate similarity. For example if I have the following two doc.

Doc A => cat dog
Doc B => dog sparrow

It is normal it's similarity would be 50% but when I calculate its TF/IDF. It is as follow

Tf values for Doc A

dog tf = 0.5
cat tf = 0.5

Tf values for Doc B

dog tf = 0.5
sparrow tf = 0.5

IDF values for Doc A

dog idf = -0.4055
cat idf = 0

IDF values for Doc B

dog idf = -0.4055 ( without +1 formula 0.6931)
sparrow idf = 0

TF/IDF value for Doc A

0.5x-0.4055 + 0.5x0 = -0.20275

TF/IDF values for Doc B

0.5x-0.4055 + 0.5x0 = -0.20275

Now it looks like there is -0.20275 similarity. Is it? Or am I missing something ? Or is any kind of next step too? Please tell me so I can calculate that too.

I used tf/idf formula which Wikipedia mentioned


回答1:


Let's see if I get your question: You want to calculate the TF/IDF similarity between the two documents:

Doc A: cat dog

and

Doc B: dog sparrow

I take it that this is your whole corpus. Therefore |D| = 2 Tfs are indeed 0.5 for all words. To calculate the IDF of 'dog', take log(|D|/|d:dog in d| = log(2/2) = 0 Similarly, the IDFs of 'cat' and 'sparrow' are log(2/1) = log(2) =1 (I use 2 as the log base to make this easier).

Therefore, the TF/IDF values for 'dog' will be 0.5*0 = 0 the TF/IDF value for 'cat' and 'sparrow' will be 0.5*1 = 0.5

To measure the similarity between the two documents, you should calculate the cosine between the vectors in the (cat, sparrow, dog) space: (0.5, 0 , 0) and (0, 0.5, 0) and get the result 0.

To sum it up:

  1. You have an error in the IDF calculations.
  2. This error creates wrong TF/IDF values.
  3. The Wikipedia article does not explain the use of TF/IDF for similarity well enough. I like Manning, Raghavan & Schütze's explanation much better.



回答2:


I think you have to take ln instead of log.




回答3:


def calctfidfvec(tfvec, withidf):
    tfidfvec = {}
    veclen = 0.0

    for token in tfvec:
        if withidf:
            tfidf = (1+log10(tfvec[token])) * getidf(token)
        else:
            tfidf = (1+log10(tfvec[token]))
        tfidfvec[token] = tfidf 
        veclen += pow(tfidf,2)

    if veclen > 0:
        for token in tfvec: 
            tfidfvec[token] /= sqrt(veclen)

    return tfidfvec

def cosinesim(vec1, vec2):
    commonterms = set(vec1).intersection(vec2)
    sim = 0.0
    for token in commonterms:
        sim += vec1[token]*vec2[token]

    return sim


来源:https://stackoverflow.com/questions/1986943/tf-idf-similarity

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!