Algorithm for efficient diffing of huge files

前端 未结 5 942
深忆病人
深忆病人 2021-01-31 05:21

I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two in

5条回答
  •  南方客
    南方客 (楼主)
    2021-01-31 06:09

    That is exactly the problem known as "data deduplication". The most commonly used approach is:

    • Read over the files in blocks:
      • Split the data of the so called "chunks". The most often used approach is called "Content defined Chunking using Rabins Fingerprinting method" (Code). Using that chunking approach leads to a better deduplication on most data set then using static sized chunks (e.g. shown here).
      • Fingerprint the chunks using a cryptographic fingerprinting method, e.g. SHA-256.
      • Store the fingerprints in an index and lookup for each chunk if the fingerprint is already known. If the fingerprint is known, there is no need to store the chunk a second time. Only when the fingerprint is not known, the data has to be stored.

    Such an data deduplication algorithm is not as exact as e.g. xdelta, but it is faster and more scalable for large data sets. The chunking and fingerprinting is performed with around 50 MB/s per core (Java). The index size depends on the redundancies, the chunk size and the data size. For 200 GB, it should fit in memory for chunk sizes of e.g. 16KB.

    Bentleys and Mciloys compression approach is very similar (used e.g. by Googles BigTable), however I am not aware of any out-of-the box command line tools using the compression technique.

    The "fs-c" open source project contains most of the code that is necessary. However, fs-c itself tries only to measure the redundancies and the analzye files in-memory or using a Hadoop cluster.

提交回复
热议问题