I am currently working on a streaming API that generates a lot of textual content. As expected, the API gives out a lot of duplicates and we also have a business requirement to
First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the MD5
hash (or something faster) of the resulting string. Do a database lookup of the MD5
hash (as two 64 bit integers) in a table, if it exists, it is an exact duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.
To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the SpotSigs
paper and blog post by Greg Linden. Suppose the routine Sigs()
does that for a given string, that is, given the normalized string x
, Sigs(x)
returns a small (1-5) set of 64 bit integers. You could use something like the SpotSigs
algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is here).
Given the Sigs()
the problem of efficiently finding the near duplicates is commonly called the set similarity joins problem. The SpotSigs
paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the simhash
method.
http://micvog.com/2013/09/08/storm-first-story-detection/ has some nice implementation notes