I am currently working on a streaming API that generates a lot of textual content. As expected, the API gives out a lot of duplicates and we also have a business requirement to
http://micvog.com/2013/09/08/storm-first-story-detection/ has some nice implementation notes