I have a query and a set of documents. I need to rank these documents based on the cosine similarity with tf-idf. Can someone please tell me what support I can get from Lucene to compute this ? What parameters I can directly calculate from Lucene (can I get tf, idf directly through some method in lucene?) and how to compute cosine similarity with Lucene (is there any function which directly returns cosine similarity if I pass two vectors of the query and the document ?)
Thanx in advance
Lucene already uses a pimped version of cosine similarity, so if you need the raw CS itself, it's probably doable. I recommend the official page that discusses Lucene scoring.
If you want to extract that info on your own, this would be an outline of the steps for tf:
- index the corpus;
- open an
IndexReader
; - iterate over all doc ids, 0 to
maxDoc()
; getTermFreqVector(doc, fieldName);
- iterate over the parallel arrays
tfv.getTerms()
andtfv.getTermFrequencies()
.
As for the docFreq, use IndexReader.terms()
and iterate over this calling termEnum.docFreq()
.
来源:https://stackoverflow.com/questions/10173202/how-to-calculate-cosine-similarity-with-tf-idf-using-lucene-and-java