Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like \"the quick brown fox jumps over the lazy dog\" or \"lorem ipsum dolor sit amet\
I'm not sure if this what you want but check out longest common substring problem and diff utility algorithms.