How to Calculate cosine similarity with tf-idf using Lucene and Java

99封情书 提交于 2019-12-08 02:14:44

问题


I have a query and a set of documents. I need to rank these documents based on the cosine similarity with tf-idf. Can someone please tell me what support I can get from Lucene to compute this ? What parameters I can directly calculate from Lucene (can I get tf, idf directly through some method in lucene?) and how to compute cosine similarity with Lucene (is there any function which directly returns cosine similarity if I pass two vectors of the query and the document ?)

Thanx in advance


回答1:


Lucene already uses a pimped version of cosine similarity, so if you need the raw CS itself, it's probably doable. I recommend the official page that discusses Lucene scoring.

If you want to extract that info on your own, this would be an outline of the steps for tf:

  1. index the corpus;
  2. open an IndexReader;
  3. iterate over all doc ids, 0 to maxDoc();
  4. getTermFreqVector(doc, fieldName);
  5. iterate over the parallel arrays tfv.getTerms() and tfv.getTermFrequencies().

As for the docFreq, use IndexReader.terms() and iterate over this calling termEnum.docFreq().



来源:https://stackoverflow.com/questions/10173202/how-to-calculate-cosine-similarity-with-tf-idf-using-lucene-and-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!