What is the fastest hash algorithm to check if two files are equal?

后端 未结 12 1520
野性不改
野性不改 2020-12-07 10:15

What is the fastest way to create a hash function which will be used to check if two files are equal?

Security is not very important.

Edit: I am sending a fi

相关标签:
12条回答
  • 2020-12-07 11:05

    In any case, you should read each file fully (except case when sizes mismatch), so just read both file and compare block-to-block.

    Using hash just gain CPU usage and nothing more. As you do not write anything, cache of OS will effectively DROP data you read, so, under Linux, just use cmp tool

    0 讨论(0)
  • 2020-12-07 11:05

    Why do you want to hash it?

    If you want to make sure that two files are equal then by definition you will have to read the entire file (unless they are literally the same file, in which case you can tell by looking at meta-data on the file system). Anyways, no reason to hash, just read over them and see if they are the same. Hashing will make it less efficient. And even if the hashes match, you still aren't sure if the files really are equal.

    Edit: This answer was posted before the question specified anything about a network. It just asked about comparing two files. Now that I know there is a network hop between the files, I would say just use an MD5 hash and be done with it.

    0 讨论(0)
  • 2020-12-07 11:10

    You could try MurmurHash, which was specifically designed to be fast, and is pretty simple to code. You might want to and a second, more secure hash if MurmurHash returns a match though, just to be sure.

    0 讨论(0)
  • 2020-12-07 11:12

    For this type of application, Adler32 is probably the fastest algorithm, with a reasonable level of security. For bigger files, you may calculate multiple hash values, for example one per block of 5 Mb of the file, hence decreasing the chances of errors (i.e. of cases when the hashes are same yet the file content differ). Furthermore this multi-hash values setup may allow the calculation of the hash to be implemented in a multi-thread fashion.

    Edit: (Following Steven Sudit's remark)
    A word of caution if the files are small!
    Adler32's "cryptographic" properties, or rather its weaknesses are well known particularly for short messages. For this reason the solution proposed should be avoided for files smaller than than a few kilobytes.
    Never the less, in the question, the OP explicitly seeks a fast algorithm and waives concerns about security. Furthermore the quest for speed may plausibly imply that one is dealing with "big" files rather than small ones. In this context, Adler32, possibly applied in parallel for files chunks of say 5Mb remains a very valid answer. Alder32 is reputed for its simplicity and speed. Also, its reliability, while remaining lower than that of CRCs of the same length, is quite acceptable for messages over 4000 bytes.

    0 讨论(0)
  • 2020-12-07 11:15

    Unless you're using a really complicated and/or slow hash, loading the data from the disk is going to take much longer than computing the hash (unless you use RAM disks or top-end SSDs).

    So to compare two files, use this algorithm:

    • Compare sizes
    • Compare dates (be careful here: this can give you the wrong answer; you must test whether this is the case for you or not)
    • Compare the hashes

    This allows for a fast fail (if the sizes are different, you know that the files are different).

    To make things even faster, you can compute the hash once and save it along with the file. Also save the file date and size into this extra file, so you know quickly when you have to recompute the hash or delete the hash file when the main file changes.

    0 讨论(0)
  • 2020-12-07 11:15

    The following is the code to find duplicate files from my personal project to sort pictures which also removes duplicates. As per my experience, first using fast hashing algo like CRC32 and then doing MD5 or SHA1 was even slower and didn't made any improvement as most of the files with same sizes were indeed duplicate so running hashing twice was more expensive from cpu time perspective, this approach may not be correct for all type of projects but it definitely true for image files. Here I am doing MD5 or SHA1 hashing only on the files with same size.

    PS: It depends on Apache commons codec to generate hash efficiently.

    Sample usage: new DuplicateFileFinder("MD5").findDuplicateFilesList(filesList);

        import java.io.File;
        import java.io.FileInputStream;
        import java.io.IOException;
        import java.util.ArrayList;
        import java.util.Collection;
        import java.util.HashMap;
        import java.util.Iterator;
        import java.util.List;
        import java.util.Map;
    
        import org.apache.commons.codec.digest.DigestUtils;
    
        /**
         * Finds the duplicate files using md5/sha1 hashing, which is used only for the sizes which are of same size.
         *  
         * @author HemantSingh
         *
         */
        public class DuplicateFileFinder {
    
            private HashProvider hashProvider;
            // Used only for logging purpose.
            private String hashingAlgo;
    
            public DuplicateFileFinder(String hashingAlgo) {
                this.hashingAlgo = hashingAlgo;
                if ("SHA1".equalsIgnoreCase(hashingAlgo)) {
                    hashProvider = new Sha1HashProvider();
                } else if ("MD5".equalsIgnoreCase(hashingAlgo)) {
                    hashProvider = new Md5HashProvider();
                } else {
                    throw new RuntimeException("Unsupported hashing algorithm:" + hashingAlgo + " Please use either SHA1 or MD5.");
                }
            }
    
            /**
             * This API returns the list of duplicate files reference.
             * 
             * @param files
             *            - List of all the files which we need to check for duplicates.
             * @return It returns the list which contains list of duplicate files for
             *         e.g. if a file a.JPG have 3 copies then first element in the list
             *         will be list with three references of File reference.
             */
            public List<List<File>> findDuplicateFilesList(List<File> files) {
                // First create the map for the file size and file reference in the array list.
                Map<Long, List<File>> fileSizeMap = new HashMap<Long, List<File>>();
                List<Long> potDuplicateFilesSize = new ArrayList<Long>();
    
                for (Iterator<File> iterator = files.iterator(); iterator.hasNext();) {
                    File file = (File) iterator.next();
                    Long fileLength = new Long(file.length());
                    List<File> filesOfSameLength = fileSizeMap.get(fileLength);
                    if (filesOfSameLength == null) {
                        filesOfSameLength = new ArrayList<File>();
                        fileSizeMap.put(fileLength, filesOfSameLength);
                    } else {
                        potDuplicateFilesSize.add(fileLength);
                    }
                    filesOfSameLength.add(file);
                }
    
                // If we don't have any potential duplicates then skip further processing.
                if (potDuplicateFilesSize.size() == 0) {
                    return null;
                }
    
                System.out.println(potDuplicateFilesSize.size() + " files will go thru " + hashingAlgo + " hash check to verify if they are duplicate.");
    
                // Now we will scan the potential duplicate files, and eliminate false positives using md5 hash check.
                List<List<File>> finalListOfDuplicates = new ArrayList<List<File>>();
                for (Iterator<Long> potDuplicatesFileSizeIterator = potDuplicateFilesSize
                        .iterator(); potDuplicatesFileSizeIterator.hasNext();) {
                    Long fileSize = (Long) potDuplicatesFileSizeIterator.next();
                    List<File> potDupFiles = fileSizeMap.get(fileSize);
                    Map<String, List<File>> trueDuplicateFiles = new HashMap<String, List<File>>();
                    for (Iterator<File> potDuplicateFilesIterator = potDupFiles.iterator(); potDuplicateFilesIterator
                            .hasNext();) {
                        File file = (File) potDuplicateFilesIterator.next();
                        try {
                            String md5Hex = hashProvider.getHashHex(file);
                            List<File> listOfDuplicatesOfAFile = trueDuplicateFiles.get(md5Hex);
                            if (listOfDuplicatesOfAFile == null) {
                                listOfDuplicatesOfAFile = new ArrayList<File>();
                                trueDuplicateFiles.put(md5Hex, listOfDuplicatesOfAFile);
                            }
                            listOfDuplicatesOfAFile.add(file);
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                    }
                    Collection<List<File>> dupsOfSameSizeList = trueDuplicateFiles.values();
                    for (Iterator<List<File>> dupsOfSameSizeListIterator = dupsOfSameSizeList.iterator(); dupsOfSameSizeListIterator
                            .hasNext();) {
                        List<File> list = (List<File>) dupsOfSameSizeListIterator.next();
                        // It will be duplicate only if we have more then one copy of it.
                        if (list.size() > 1) {
                            finalListOfDuplicates.add(list);
                            System.out.println("Duplicate sets found: " + finalListOfDuplicates.size());
                        }
                    }
                }
    
                return finalListOfDuplicates;
            }
    
            abstract class HashProvider {
                abstract String getHashHex(File file) throws IOException ;
            }
    
            class Md5HashProvider extends HashProvider {
                String getHashHex(File file) throws IOException {
                    return DigestUtils.md5Hex(new FileInputStream(file));
                }
            }
            class Sha1HashProvider extends HashProvider {
                String getHashHex(File file) throws IOException {
                    return DigestUtils.sha1Hex(new FileInputStream(file));
                }
            }
        }
    
    0 讨论(0)
提交回复
热议问题