get cosine similarity between two documents in lucene

后端 未结 7 2096
野性不改
野性不改 2020-11-27 03:13

i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index.

相关标签:
7条回答
  • 2020-11-27 04:00

    you can find better solution @ http://darakpanand.wordpress.com/2013/06/01/document-comparison-by-cosine-methodology-using-lucene/#more-53 . following are the steps

    • java code which builds term vector from content with the help of Lucene(check:http://lucene.apache.org/core/).
    • By using commons-math.jar library cosine calculation between two documents is done.
    0 讨论(0)
  • 2020-11-27 04:01

    As Julia points out Sujit Pal's example is very useful but the Lucene 4 API has substantial changes. Here is a version rewritten for Lucene 4.

    import java.io.IOException;
    import java.util.*;
    
    import org.apache.commons.math3.linear.*;
    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.core.SimpleAnalyzer;
    import org.apache.lucene.document.*;
    import org.apache.lucene.document.Field.Store;
    import org.apache.lucene.index.*;
    import org.apache.lucene.store.*;
    import org.apache.lucene.util.*;
    
    public class CosineDocumentSimilarity {
    
        public static final String CONTENT = "Content";
    
        private final Set<String> terms = new HashSet<>();
        private final RealVector v1;
        private final RealVector v2;
    
        CosineDocumentSimilarity(String s1, String s2) throws IOException {
            Directory directory = createIndex(s1, s2);
            IndexReader reader = DirectoryReader.open(directory);
            Map<String, Integer> f1 = getTermFrequencies(reader, 0);
            Map<String, Integer> f2 = getTermFrequencies(reader, 1);
            reader.close();
            v1 = toRealVector(f1);
            v2 = toRealVector(f2);
        }
    
        Directory createIndex(String s1, String s2) throws IOException {
            Directory directory = new RAMDirectory();
            Analyzer analyzer = new SimpleAnalyzer(Version.LUCENE_CURRENT);
            IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_CURRENT,
                    analyzer);
            IndexWriter writer = new IndexWriter(directory, iwc);
            addDocument(writer, s1);
            addDocument(writer, s2);
            writer.close();
            return directory;
        }
    
        /* Indexed, tokenized, stored. */
        public static final FieldType TYPE_STORED = new FieldType();
    
        static {
            TYPE_STORED.setIndexed(true);
            TYPE_STORED.setTokenized(true);
            TYPE_STORED.setStored(true);
            TYPE_STORED.setStoreTermVectors(true);
            TYPE_STORED.setStoreTermVectorPositions(true);
            TYPE_STORED.freeze();
        }
    
        void addDocument(IndexWriter writer, String content) throws IOException {
            Document doc = new Document();
            Field field = new Field(CONTENT, content, TYPE_STORED);
            doc.add(field);
            writer.addDocument(doc);
        }
    
        double getCosineSimilarity() {
            return (v1.dotProduct(v2)) / (v1.getNorm() * v2.getNorm());
        }
    
        public static double getCosineSimilarity(String s1, String s2)
                throws IOException {
            return new CosineDocumentSimilarity(s1, s2).getCosineSimilarity();
        }
    
        Map<String, Integer> getTermFrequencies(IndexReader reader, int docId)
                throws IOException {
            Terms vector = reader.getTermVector(docId, CONTENT);
            TermsEnum termsEnum = null;
            termsEnum = vector.iterator(termsEnum);
            Map<String, Integer> frequencies = new HashMap<>();
            BytesRef text = null;
            while ((text = termsEnum.next()) != null) {
                String term = text.utf8ToString();
                int freq = (int) termsEnum.totalTermFreq();
                frequencies.put(term, freq);
                terms.add(term);
            }
            return frequencies;
        }
    
        RealVector toRealVector(Map<String, Integer> map) {
            RealVector vector = new ArrayRealVector(terms.size());
            int i = 0;
            for (String term : terms) {
                int value = map.containsKey(term) ? map.get(term) : 0;
                vector.setEntry(i++, value);
            }
            return (RealVector) vector.mapDivide(vector.getL1Norm());
        }
    }
    
    0 讨论(0)
  • 2020-11-27 04:01

    Calculating Cosine Similarity in Lucene Version 4.x is different than those from 3.x. Following post has detailed explanation with all the necessary code for calculating cosine similarity in Lucene 4.10.2. ComputerGodzilla: Calculated Cosine Similarity in Lucene!

    0 讨论(0)
  • 2020-11-27 04:02

    When indexing, there's an option to store term frequency vectors.

    During runtime, look up the term frequency vectors for both documents using IndexReader.getTermFreqVector(), and look up document frequency data for each term using IndexReader.docFreq(). That will give you all the components necessary to calculate the cosine similarity between the two docs.

    An easier way might be to submit doc A as a query (adding all words to the query as OR terms, boosting each by term frequency) and look for doc B in the result set.

    0 讨论(0)
  • 2020-11-27 04:04

    I know question has been answered, but for the people who might come here in future, nice example of the solution can be found here:

    http://sujitpal.blogspot.ch/2011/10/computing-document-similarity-using.html

    0 讨论(0)
  • 2020-11-27 04:13

    If you don't need to store documents to Lucene and just want to calculate similarity between two docs, here's the faster code (Scala, from my blog http://chepurnoy.org/blog/2014/03/faster-cosine-similarity-between-two-dicuments-with-scala-and-lucene/ )

    def extractTerms(content: String): Map[String, Int] = {    
         val analyzer = new StopAnalyzer(Version.LUCENE_46)
         val ts = new EnglishMinimalStemFilter(analyzer.tokenStream("c", content))
         val charTermAttribute = ts.addAttribute(classOf[CharTermAttribute])
    
         val m = scala.collection.mutable.Map[String, Int]()
    
         ts.reset()
         while (ts.incrementToken()) {
             val term = charTermAttribute.toString
             val newCount = m.get(term).map(_ + 1).getOrElse(1)
             m += term -> newCount       
         }
    
         m.toMap
     }
    
    def similarity(t1: Map[String, Int], t2: Map[String, Int]): Double = {
         //word, t1 freq, t2 freq
         val m = scala.collection.mutable.HashMap[String, (Int, Int)]()
    
         val sum1 = t1.foldLeft(0d) {case (sum, (word, freq)) =>
             m += word ->(freq, 0)
             sum + freq
         }
    
         val sum2 = t2.foldLeft(0d) {case (sum, (word, freq)) =>
             m.get(word) match {
                 case Some((freq1, _)) => m += word ->(freq1, freq)
                 case None => m += word ->(0, freq)
             }
             sum + freq
         }
    
         val (p1, p2, p3) = m.foldLeft((0d, 0d, 0d)) {case ((s1, s2, s3), e) =>
             val fs = e._2
             val f1 = fs._1 / sum1
             val f2 = fs._2 / sum2
             (s1 + f1 * f2, s2 + f1 * f1, s3 + f2 * f2)
         }
    
         val cos = p1 / (Math.sqrt(p2) * Math.sqrt(p3))
         cos
     }  
    

    So, to calculate similarity between text1 and text2 just call similarity(extractTerms(text1), extractTerms(text2))

    0 讨论(0)
提交回复
热议问题