Near duplicate detection in Solr

Solr is being used to search through a database of user-generated listings. These listings are imported into Solr from MySQL via the DataImportHandler.

Problem: Quite often, users report the same listing to the database, sometimes with minor changes to their listing post to avoid being easily detected as a duplicate post.

How should I implement a near-duplication detection with Solr? I do not mind having near-duplicate listings in the Solr index as long as the search results do not contain these near-duplicate listings.

I guess there are 4 possible places to do this near-duplicate detection

When the user submits the listing (PHP is being used here)
During the data import from MySQL to Solr
After the data import from MySQL
When a search is being done

What is the recommended way to do this? Thank you!

i'm not familiar with Solr, i would implement the "near-duplication" when the user submits the listing. There are quit different algorithms to detect near-duplicates like the Jaccard Indexing.

I made a little script to see the difference between the similarity coefficients:

<?php

$input1 = "Hello there, this is a test 1, you see it's almost the same";
$input2 = "Hello there, this is a test 2, you saw it, it's almost the same";
$input3 = "this is very different from the others, but who knows ?";

echo jackard($input1, $input1) . "<br />"; // results 1

echo jackard($input1, $input2) . "<br />"; // results 0.81481481481481

echo jackard($input1, $input3) . "<br />"; // results 0.25

echo jackard($input2, $input3); // results 0.24


function jackard($a, $b){
    $a_arr = explode(" ", $a);
    $b_arr = explode(" ", $b);
    $intersect_a_b = array_intersect($a_arr,$b_arr);
    return((count($intersect_a_b)/(count($a_arr)+count($b_arr)))*2);
}
?>

You may see, that if the result is 1, it means that it's the same sentence OR it uses the same words in a different order. However, the smaller the value is, the more unique the "sentence" is. This is rather a simple implementation. You may set a limit value for example 0.4. And set the "request" in a queue if it passes this limit. And then take a look manualy at the listing. This is not "efficient". But i gave you the idea, and it's up to you to develop a more complex and automated system/algorithm. And maybe you should also take a look here.

来源：https://stackoverflow.com/questions/12766309/near-duplicate-detection-in-solr

标签

php

solr

lucene

duplicates

dataimporthandler