Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like \"the quick brown fox jumps over the lazy dog\" or \"lorem ipsum dolor sit amet\
Something like this might work, depending on whether you care about word boundaries. In pseudo-code (where LCS
is a function for computing the Longest Common Subsequence):
someMinimumLengthParameter = 20;
foundPhrases = [];
do {
lcs = LCS(mailbodies);
if (lcs in ignoredPhrases) continue;
foundPhrases += lcs;
for body in mailbodies {
body.remove(lcs);
}
} while(lcs.length > someMinimumLengthParameter);
I'm not sure if this what you want but check out longest common substring problem and diff utility algorithms.
Have a look at N-grams. The most common phrases will necessarily contribute the most common N-grams. I'd start out with word trigrams and see where that leads. (Space required is N times the length of the text, so you can't let N get too big.) If you save the positions and not just a count, you can then see if the trigrams can be extended to form common phrases.