How to Calculate cosine similarity with tf-idf using Lucene and Java

南笙酒味 提交于 2019-12-06 06:00:37

Lucene already uses a pimped version of cosine similarity, so if you need the raw CS itself, it's probably doable. I recommend the official page that discusses Lucene scoring.

If you want to extract that info on your own, this would be an outline of the steps for tf:

  1. index the corpus;
  2. open an IndexReader;
  3. iterate over all doc ids, 0 to maxDoc();
  4. getTermFreqVector(doc, fieldName);
  5. iterate over the parallel arrays tfv.getTerms() and tfv.getTermFrequencies().

As for the docFreq, use IndexReader.terms() and iterate over this calling termEnum.docFreq().

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!