Algorithm for efficient diffing of huge files

前端 未结 5 945
深忆病人
深忆病人 2021-01-31 05:21

I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two in

相关标签:
5条回答
  • 2021-01-31 05:49

    Take a look at RSYNCs algorithm, as it's designed pretty much to do exactly this so it can efficiently copy deltas. And the algorithm is pretty well documented, as I recall.

    0 讨论(0)
  • 2021-01-31 05:59

    You can use rdiff, which works very well with large files. Here I create a diff of two big files A and B:

    1. Create a signature of one file, with e.g.

      rdiff signature A sig.txt
      
    2. using the generated signature file sig.txt and the other big file, create the delta:

      rdiff delta sig.txt B delta
      
    3. now delta contains all the information you need to recreate file B when you have both A and delta. To recreate B, run

      rdiff patch A delta B
      

    In Ubuntu, just run sudo apt-get install rdiff to install it. It is quite fast, I get about 40 MB per second on my PC. I have just tried it on a 8GB file, and the memory used by rsync was about 1MB.

    0 讨论(0)
  • 2021-01-31 06:05

    Depending on your performance requirements, you could get away with sampling the chunks you fingerprint, and growing them when they match. That way you don't have to run a checksum on your entire large file.

    If you need arbitrary byte alignments and you really care about performance, look at the simhash algorithm, and use it to find similar but unaligned blocks.

    0 讨论(0)
  • 2021-01-31 06:09

    That is exactly the problem known as "data deduplication". The most commonly used approach is:

    • Read over the files in blocks:
      • Split the data of the so called "chunks". The most often used approach is called "Content defined Chunking using Rabins Fingerprinting method" (Code). Using that chunking approach leads to a better deduplication on most data set then using static sized chunks (e.g. shown here).
      • Fingerprint the chunks using a cryptographic fingerprinting method, e.g. SHA-256.
      • Store the fingerprints in an index and lookup for each chunk if the fingerprint is already known. If the fingerprint is known, there is no need to store the chunk a second time. Only when the fingerprint is not known, the data has to be stored.

    Such an data deduplication algorithm is not as exact as e.g. xdelta, but it is faster and more scalable for large data sets. The chunking and fingerprinting is performed with around 50 MB/s per core (Java). The index size depends on the redundancies, the chunk size and the data size. For 200 GB, it should fit in memory for chunk sizes of e.g. 16KB.

    Bentleys and Mciloys compression approach is very similar (used e.g. by Googles BigTable), however I am not aware of any out-of-the box command line tools using the compression technique.

    The "fs-c" open source project contains most of the code that is necessary. However, fs-c itself tries only to measure the redundancies and the analzye files in-memory or using a Hadoop cluster.

    0 讨论(0)
  • 2021-01-31 06:12

    one question is what is the record size in your files, i.e. can the offsets change byte by byte or do the files consist of, say, 1024B blocks. Assuming the data is byte-oriented, you could do the following:

    1. Create a suffix array for the file A. This array is a permutation of all index values to the file A. If A has 2^37 bytes then the index array is easiest represented by 64-bit integers, so every byte (offset to the file) corresponds to 8 bytes in the index array, so the index array will be 2^40 bytes long then. E.g. 800 GB, say. You can also index only every 1024th location, say, to reduce the size of the index array. This then detoriates the quality of packing depending on how long the average runs of copyable fragments are.

    2. Now then to greedily pack the file B you start from its beginning at offset o=0 and then use the index array to find the longest match in A that matches the data starting at 'o'. You output the pair in the packed file. This takes in your case without any encoding 16 bytes, so if the run is < 16 bytes you actually lose space. This can be easily remedied by using then bit-level encoding and using a bit marker to mark whether you encode an isolated byte (marker + 8 bits = 9 bits) or an offset/length pair (marker + 40 bits + 40 bits = 81 bits), say. After packing the longest fragment at o, increase o to the next byte after the fragment and repeat until at end of file.

    The construction and use of a suffix array is easy and you should find references easily. In high-speed applications people use suffix trees or suffix tries instead, which are more complex to manipulate but provide faster lookup. In your case you're going to have the array on secondary storage and if the run speed of the packing phase is not an issue a suffix array should be enough.

    0 讨论(0)
提交回复
热议问题