How to Calculate cosine similarity with tf-idf using Lucene and Java

问题

I have a query and a set of documents. I need to rank these documents based on the cosine similarity with tf-idf. Can someone please tell me what support I can get from Lucene to compute this ? What parameters I can directly calculate from Lucene (can I get tf, idf directly through some method in lucene?) and how to compute cosine similarity with Lucene (is there any function which directly returns cosine similarity if I pass two vectors of the query and the document ?)

Thanx in advance

回答1:

Lucene already uses a pimped version of cosine similarity, so if you need the raw CS itself, it's probably doable. I recommend the official page that discusses Lucene scoring.

If you want to extract that info on your own, this would be an outline of the steps for tf:

index the corpus;
open an IndexReader;
iterate over all doc ids, 0 to maxDoc();
getTermFreqVector(doc, fieldName);
iterate over the parallel arrays tfv.getTerms() and tfv.getTermFrequencies().

As for the docFreq, use IndexReader.terms() and iterate over this calling termEnum.docFreq().

来源：https://stackoverflow.com/questions/10173202/how-to-calculate-cosine-similarity-with-tf-idf-using-lucene-and-java

标签

java

lucene

tf-idf

cosine-similarity

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!