I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two in
That is exactly the problem known as "data deduplication". The most commonly used approach is:
Such an data deduplication algorithm is not as exact as e.g. xdelta, but it is faster and more scalable for large data sets. The chunking and fingerprinting is performed with around 50 MB/s per core (Java). The index size depends on the redundancies, the chunk size and the data size. For 200 GB, it should fit in memory for chunk sizes of e.g. 16KB.
Bentleys and Mciloys compression approach is very similar (used e.g. by Googles BigTable), however I am not aware of any out-of-the box command line tools using the compression technique.
The "fs-c" open source project contains most of the code that is necessary. However, fs-c itself tries only to measure the redundancies and the analzye files in-memory or using a Hadoop cluster.