If these are pure text documents, or you have a method to extract the text from the documents, you can use a technique called shingling.
You first compute a unique hash for each document. If these are the same, you are done.
If not, you break each document down into smaller chunks. These are your 'shingles.'
Once you have the shingles, you can then compute identity hashes for each shingle and compare the hashes of the shingles to determine if the documents are actually the same.
The other technique you can use is to generate n-grams of the entire documents and compute the number of similar n-grams in each document and produce a weighted score for each document. Basically an n-gram is splitting a word into smaller chunks. 'apple' would become ' a', ' ap', 'app', 'ppl', 'ple', 'le '. (This is technically a 3-gram) This approach can become quite computationally expensive over a large number of documents or over two very large documents. Of course, common n-grams 'the', ' th, 'th ', etc need to be weighted to score them lower.
I've posted about this on my blog and there are some links in the post to a few other articles on the subject Shingling - it's not just for roofers.
Best of luck!